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Abstract 

We analyze the convergence behaviour of a recently proposed algorithm for regularized 
estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new in- 
terpretation of DAL as a proximal minimization algorithm. We theoretically show under 
some conditions that DAL converges super-linearly in a non-asymptotic and global sense. 
Due to a special modelling of sparse estimation problems in the context of machine learning, 
the assumptions we make are milder and more natural than those made in conventional 
analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables 
us to generalize DAL to wide varieties of sparse estimation problems. We experimentally 
confirm our analysis in a large scale ^i-regularized logistic regression problem and exten- 
sively compare the efficiency of DAL algorithm to previously proposed algorithms on both 
synthetic and benchmark datasets. 

Running title: Dual Augmented-Lagrangian Converges Super-Linearly 
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1. Introduction 

Sparse estimation through convex regularization has become a common practice in many 
application areas including bioinformatics and natural language processing. However facing 
the rapid increase in the size of data-sets that we analyze everyday, clearly needed is the 
development of optimization algorithms that are tailored for machine learning applications. 

Regularization-based sparse estimation methods estimate unknown variables through 
the minimization of a loss term (or a data-fit term) plus a regularization term. In this 
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paper, we focus on convex methods; i.e., both the loss term and the regularization term 
are convex functions of unknown variables. Regularizers may be non-differentiable on some 
points; the non-differentiability can promote various types of sparsity on the solution. 

Although the problem is convex, there are three fact ors that challenge the straight - 
forward application of general tools for convex optimization ( Bovd and Vandenberghe , 20041 ) 
in the context of machine learning. 

The first factor is the diversity of loss functions. Arguably the squared loss is most 
commonly used in the field of signal/ima ge reconstruction, in which n i any algorithms for 



sparse estimation hav e been developed ( Figueiredo and Nowak . 20031 : Daubechies et al 



20041 : ICai et al.l . |2008| ). However the variety of loss functions is much wider in machine 



learning, to name a few, logistic loss and other log-linear loss functions. Note that these 
functions are not necessarily strongly convex like the squared loss. See Table [3] for a list of 
loss functions that we consider. 

The second factor is the nature of the data matrix, which we call the design matrix 
in this paper. For a regression problem, the design matrix is defined by stacking input 
vectors along rows. If the input vectors are numerical (e.g., gene expression data), the 
design matrix is dense and has no structure. In addition, the characteristics of the matrix 
(e.g., the condition number) is unknown until the data is provided. Therefore, we would 
like to minimize assumptions about the design matrix, such as, sparse, structured, or well 
conditioned. 

The third factor is the large number of unknown variables (or parameters) compared 
to observations. This is a situation regularized estimation methods are commonly applied. 
This factor may have been overlooked in the context of signal denoising, in which the number 
of observations and the number of parameters are equal. 



Va ri ous methods have b een pr o posed for efficient spar s e estim a tion fseelFigueiredo and Nowak 
(I2OO3I): [Paubechies et all (l2004 : Icombettes and Waid dioosl): lAndrew and Gaol (|2007l ): 



Koh et al.l tooii ): IWright et al.l (|2009l ): iBeck and Teboullel toO^j ): IYu et all ^20m ). and 



the references therein). Many previous studies focus on the non- differentiability of the 
regularization term. In contrast, we focus on the couplings between variables (or non- 
separability) caused by the design matrix. In fact, if the optimization problem can be de- 
comp qsed into smaller (e.g ., containing a single variable) problems, optimization is easy. Re- 
cently |Wrighte^.aJjj2009|}_sl^^ so called iterative s hrink a ge /thresholding (1ST) 
metho d (see Figueiredo and Nowak ( 20031 ): Daubechies et al. ( 2004 ): Combettes and Waid 
^200^ ): iFimieiredo et al. l (j2007al )^ can be seen as an iterative separable approximation pro- 
cess. 

In this paper, we show that a recen tly proposed dual augmented Lagrangian (DAL) algo- 
rithm (jTomioka and Sugiyamal . l2009l ^ can be considered a s an exact (up to fin ite tolerance) 
version of the iterative approximation process discussed in I Wright et al.l (120091) . Our formu - 
lation is based on the connection between the p roximal rninimization (Rockafellad. Il976al) 
and the augmented La grangian (AL) algorithm (|Hestened . Il969l : IPowell 11969^ IRockafellari . 
1976bl : iBertsekad . Il982l ). The proximal minimization framework also allows us to rigorously 



study the convergence behaviour of DAL. We show that DAL converges super-linearly under 
some mild conditions, which means that the number of iterations that we need to obtain an 
e-accurate solution grows no greater than logarithmically with 1/e. Due to the generality 
of the framework, our analysis applies to a wide variety of practically important regulariz- 
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ers. Our anal ysis improves the cla ssical result on the convergence of augmented Lagrangian 
algorithms in iRockafellarl (|l976bl ) by taking special structures of sparse est i mation into 
account. In addition, we make no asymptotic arguments as in IRockafellarl (|l976bl ;i and 



Kort and Bertsekad (Il976l): instead our convergence analysis is build on top of the recent 



result in 



Beck and Teboullel JioO^). 



Augmented Lagrangian formulations have also been considered in lYin et al.l (|2008l ) and 



Goldstein and Oshed (l2008l) for sparse s ignal reconstruction. What differentiates DAL ap- 



proach of lTomioka and Sugivam af (|2009l ) from those studied earlier is that the AL algorithm 
is applied to the dual problem (see Section 12. 2p . which results in an inner minimization 
problem that can be solved efficiently exploiting the sparsity of intermediate solutions (see 
Section l4.ip . Applying AL formulation to the dual problem also plays an important role 
in the convergence analysis because some loss fu nctions (e.g., logi s tic los s) are not strongly 
convex in the primal; see Section O Recently lYang and Zhanl (H) compared primal 
and dual augmented Lagrangian algori thms for £i -problems and reported that the dual 
formulation was more efficient. See also Tomioka et al.l ( 2011 ) for related discussions. 

This paper is organized as follows. In Section [2l we mathematically formulate the 
sparse estimation problem and we review DAL algorithm. We derive DAL algorithm from 
the proximal minimization framework in Section [3l and discuss special instances of DAL 
algorithm are discussed in Section [H In Section [5l we theoretically analyze the convergence 
behaviour of DAL algorithm. We discuss previously proposed algorithms in Section [6] and 
contrast them with DAL. In Section [7] we confirm our analysis in a simulated £i-regularized 
logistic regression problem. Moreover, we extensively compare recently proposed algorithms 
for £i-regularized logistic regression including DAL in synthetic and benchmark datasets 
under a variety of conditions. Finally we conclude the paper in Section [HI Most of the 
proofs are given in the appendix. 



2. Sparse estimation problem and DAL algorithm 

In this section, we first formulate the sparse estimation problem as a convex optimization 
problem, and state our assumptions. Next we derive DAL algorithm for £i-problem as an 
augmented Lagrangian method in the dual. 

2.1 Objective 

We consider the problem of estimating an n dimensional parameter vector from m training 
examples as described in the following optimization problem: 



minimize fi{Aw) + (j)x{'w), (1) 



where w S is the parameter vector to be estimated, A S W^^"' is a design matrix, and 
fi{-) is a loss function. We call the first term in the minimand the loss term and the second 
term the regularization term, or the regularizer. 
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We assume that the loss function fi : — t- MU{+oo} is a closed proper strictly convex 
functior0. See Table [3] for examples of loss functions. We assume that has Lipschitz 
continuous gradient with modulus I/7 (see Assumption (A2) in Section [5.20 . If fi is twice 
differentiable, this condition is equivalent to saying that the maximum eigenvalue of the 
Hessian of fi is uniformly bounded by I/7. Such 7 exists for example for quadratic loss, 
logistic loss, and other log-linear losses. However, non-smooth loss functions (e.g., the hinge 
loss and the absolute loss) are excluded. Note that since we separate the data matrix A 
from the loss function, we can quantify the above constant 7 without examining the data. 
Moreover, we assume that the convex conjugat^ is (essentially) twice differentiable. 
Note that the first order different iability of the conv ex conjugate is implied by the strict 
convexity of the loss function fg ( Rockafellar . 197d . Theorem 26.3). 

The regularization term (j)x{w) is a convex possibly non-differentiable function. In ad- 
dition, we assume that for all r/ > 0, 7]cj)x{w) = (prjxiw). 

An important special case, which has been studied by rn any authors ( Tibshirani . 19961 : 
Efron et all . I2OO4I : lAndrew and Gaol . I2OO7I : iKoh et al.1 . 12OO7I ) is the £i-regularization: 



minimize fi{Aw) + A||iu||i, 



(2) 



where \\w\\i = \ wj\ is the £i-norm of w. 



2.2 Dual augmented Lagrangian (DAL) algorithm 



In thi s subsection, we review DAL algorithm following the line of iTomioka and Sugivama 
Although, the squared loss function and the £i-regularizer were considered in the 
original paper, we deal with a slightly more general setting in Eq. ([2|) for notational conve- 
nience; i.e., we consider general closed convex loss functions i nstead o f the s quared lo s s. For 
general informatio n on augmented Lagr angi an algorithms (iPowell Il969l : iHestened . Il969l : 
Rockafellail . Il976bl ) , see iBertsekasI ^WSi ) and iNocedal and WrightJ (|l999l ). 

Let (j)\{w) be the ^i-regu l arizer , i.e., 4>\{w) = X\\w\\i = XJ2j=i l^il- Using the Fenchel 
duality theorem ( Rockafellar . 197d ). the dual of the problem ([2]) can be written as follows: 



maximize 



subject to V = A a, 



(3) 
(4) 



where S'^ is the indicator function ( Rockafellar . 1970l . p28) of the £oo-ball of radius A, namely 



(5) 



where S'^ivj) = 0, if \vj\ < A, and -|-oo otherwise. 

1. "Closed" means that the epigraph {{z,y) £ R"'"'"'^ : y > fe(z)} is a closed set, and "proper" means that 
the function is not everywhere +00; see e.g.. iRockafellaj l|l970l ). In the sequel, we use the word "convex 
function" in the meaning of "closed proper convex function" . 

2. The convex conjugate of a function / : R" — >■ R U {+00} is a function /* over R" that takes values in 
R" U {+00} and is defined as f*{y) = sup^gj|„ (y^a; — f{x)). 
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Let us consider the augmented Lagrangian (AL) function with respect to the above 
dual problem ([3]) 

v; w) = -//(-a) - d^v) + w'^ (v - A^a) - |||^ - A^af, (6) 

where the primal variable w € M"' is interpreted as a Lagrangian multiplier vector in the 
AL framework. Note that the AL function is the ordinary Lagrangian if 77 = 0. 

Let T]Q,r]i,. . . be a non-decreasing sequence of positive numbers. At every time step t, 
given the current primal solution w^, we maximize the AL function L^^{a,v;w^) with re- 
spect to a and v. The maximizer {a*',v^) is used to update the primal solution (Lagrangian 
multiplier) as follows: 

iti*+i = w* + 77t(A"^Q!* -t;*). (7) 

Note that the maximization of the AL function ([6]) with respect to v can be carried out 
in a closed form, because the terms involved in the maximization can be separated into n 
terms, each containing single vj, as follows: 

n 

2 



where denotes the jth element of a vector. Since S'^{vj) is infinity outside the domain 
—A < Vj < A, the maximizer v^(a) is obtained as a projection onto the ioo ball of radius A 
as follows (see also Figure [9]) : 



■u*(a) = proj[_;, (w^/r]t + A'^aj := ( mm {\yj\, X) f^] 

^ ^ \ Wjl/ j=i 



(8) 



where {yj)^^i denotes an n-dimensional vector whose jth element is given by yj. Note that 
the ratio yj/\yj\ is defined to be zercH if yj = 0. Substituting the above back into Eq. ([7]), 
we obtain the following update equation: 

= prox^;^^ (ii;* + r]tA~^(X^), 
where prox^^^^ is called the soft-threshold operatioij^ and is defined as follows: 

prox'^iy) := (^max(|y,| - A,0)^) . (9) 

The soft-threshold operation is we ll known in signal proce s sing c ommunity and ha s been 
studied extensively (Donohol . 1995 : Figueiredo and Nowak . 2003 : Daubechies et al. . 2004 : 



Combettes and Waja . l2005l ). 



3. This is equivalent to defining yj/\yj\ ~ sign(?/j). We use t/j/|j/j| instead of sign(j/j) to define tlie soft- 
tfiresliold operations corresponding to £1 and tlie group-lasso regularizations (see Section |4]2]) in a similar 
way. 

4. This notation is a simplified version of the general notation we introduce later in Eq. (|15|l . 
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Furthermore, substituting the above v*{a) into Eq. ([6]), we can express a* as the mini- 
mizer of the function 

^t{a) := -Lr,,{cy,v\cx)-w') = fl{-cx) + -L||prox^^i^^(^* + oc)\\\ (10) 

which we also call an AL function with a slight abuse of terminology. Note that the maxi- 
mization in Eq. ([6]) is turned into a minimization of the above function by negating the AL 
function. 

3. Proximal minimization view 

The first contribution of this paper is to deri ve DAL algorithm w e reviewed in Section 12.21 
from the proximal minimization framework fao ckafellar . 1976a ). which allows for a new 



interpretation of the algorithm (see Section 13. 3p and rigorous analysis of its convergence 
(see Section [5]). 

3.1 Proximal minimization algorithm 

Let us consider the follow ing iterative algorithm called the proximal minimization algo- 



rithm (|Rockafellaij . Il976al l for the minimization of the objective ([T]). 

1. Choose some initial solution and a sequence of non-decreasing positive numbers 
% < ??i < • • • • 



2. Repe at until some criterion (e.g., duality gap ( Wright et al. . 20091 : Tomioka and Sugiyama . 
iooi)) is satisfied: 



w*-^^ = argmin ( f{w) + —\\w - w^\\'^] , (11) 

where f{w) is the objective function in Eq. ([T]) and 77^ controls the influence of the 
additional 'proximity term. 

The proximity term tries to keep the next solution tt)*"*"^ close to the current solution tu*. 
Im portantly, the objec tive ()lip is strongly convex even if the original objective ([T]) is not; 
see iRockafellail (|l976bl ^. Although at this point it is not clear how we are going to carry 



out the above minimization, by definition we have /(lo*"*"^) + 2^^1110*+^ — < f(w^); i.e., 
provided that the step-size is positive, the function value decreases monotonically at every 
iteration. 

3.2 Iterative shrinkage/thresholding algorithm from the proximal 
minimization framework 



The function to be minimized in Eq. (jlip is strongly convex. However, there seems to be 
no obvious way to minimize Eq. (Ilip . because it is still (possibly) non-differentiable and 
cannot be decomposed into smaller problems because the elements of w are coupled. 

One w ay to make the prox imal minimization algorithm practical is to linearly approxi- 



mate (see 



Wright et al. ( 20091 )) the loss term at the current point 



tu* as 



fe{Aw) ~ h{Aw') + (V/*) {w - w') , (12) 
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where V/^ is a short hand for \7 /^{Aw*). Substituting the above approximation (jl2p into 
the iteration (jlip . we obtain 

= argmin ( {V fe) ^ Aw + cffxiw) + —\\w - w^f] , (13) 

where constant terms are omitted from the right-hand side. Note that because of the 
hnear approximation, there is no couphng between the elements of w. For example, if 
4>\(w) = X\\w\\, the minimand in the right-hand side of the above equation can be separated 
into n terms each containing single wj, which can be separately minimized. 

Rewriting the above up date equation, we obta i n the well-known itera, t ive shrinkage/ 



Kewritmg tne above up date equation, we obta i n tne well-known itera- t ive snrmKage/ 

thresholding (1ST) methocj^ ( Figueiredo and Nowak . 20031 : Daubechies et al. . 2004 : Combettes and 



20051 : iFigueiredo et al.l . l2007al ) . The 1ST iteration can be written as follows: 



w'+^ := prox. (w' - r^^A^V/i , (14) 



where the proximity operator prox^^^ is defined as follows: 

1.. .,2 



prox^^ (2/) = argmin -||y - x\\ + (j)x{x) . (15) 

Note that the soft-threshold operation prox^^ ([9]) is the proximity operator corresponding 
to the £i-regularizer (j)^-^{w) = A||iu||i. 

3.3 DAL algorithm from the proximal minimization framework 

The above 1ST approach can be considered to be constructing a linear lower bound of the 
loss term in Eq. pip at the current point lo*. In this subsection we show that we can 
precisely (to finite precision) minimize Eq. (jlip using a parametrized linear lower bound 
that can be adjusted to be the tightest at the next point Our approach is based on 

the convexity of the loss function fi. First note that we can rewrite the loss function fi as 
a point-wise maximum as follows: 

fi{Aw) = max ({-ay Aw - /;(-«)) , (16) 



where is the convex conjugate functions of fg. Now we substitute this expression into 
the iteration (lllh as follows: 



w*"*"^ = argmin max \ —(x^ Aw — fU—a) + (p\{w) H \\w — w^\\'^ )■ . (17) 



Note that now the loss term is expressed as a linear function as in the 1ST approach; 
see Eq. (|13p . Now we exchange the order of minimization and maximization because the 



5. It is also known as the forward-backward splitting method ([Lions and Mercieil . 1 19791 : 
ICombettes and Waid . [20051 : iDuchi and Singer] . 120091 '): see Section [Sj 
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function to be minimaxed in E g. (1171) is a s addle function (i.e., convex with respect to w 



and concave with respect to a. (|Rockafellarl . ll97d l). as follows: 

min max < —cyJ Aw — f«(—a) + d>x(w) H \\w — 

= max I -f/(-Q:) + min ( -<x^ Aw + (j)x(w) + —\\w - w*\\'^]\ . (18) 

Notice the similarity between the two minimizations (jl3p and (jlSp (with fixed o;). 

The minimization with respect to w in Eq. (jl8p gives the following update equation 

w'+^ = prox^,^^ (w' + 7?tA^a*) , (19) 

where ct* denotes the maximizer with respect to a in Eq. (Iisp . Note that ct* is in general 
different from — V/^ used in the 1ST approach (|14p . Actually, we show below that ct* = 
-V/*+^ if the max-min problem (llSp is solved exactly. Therefore taking ct* = —Vfj can 
be considered as a naive approximation to this. 

The final step to derive DAL algorithm is to compute the maximizer a* in Eq. (llSp . 
This step is slightly involved and the derivation is presented in Appendix |Bj The result of 
the derivation can be written as follows (notice that the maximization in Eq. p8p is turned 
into a minimization by reversing the sign): 

a* = argmin(/;(-a) + -n^A^' + mA^a)) , (20) 



where the function is called the Moreau envelope of cp*^ (see iMoreaul (jl965l ) : iRockafellar 
and is defined as follows: 



^IM = mm (^4>l{x) + \\\x- w\\^^ . (21) 

We call the function ftiot) in Eq. ()20p the augmented Lagrangian (AL) function. 

What we need to do at every iteration is to minimize the AL function ftiot) and update 
the Lagrangian multiplier w^ as in Eq. ()19p using the minimizer ct* in Eq. (I20p . Of course 
in practice we would like to stop the inner minimization at a finite tolerance. We discuss 
the stopping condition in Section O 

The algorithm we derived above is indeed a generalization of DAL algorithm we reviewed 
in Section [2T2I This can be shown by computing the proximity operator (jl9p and the Moreau 
envelope (f2T]) for the specific case of ^i-regularization; see Section 14.11 and also Table HI 

The AL function (pt{ot) is continuously differentiable, because the AL function is a sum of 

(differentiable by assumption) and an envelope function (differentiable; see Appendix 
In fact, using Lemma [TO] in AppendixjAj the derivative of the AL function can be evaluated 
as follows: 

Viptia) = -Vfti-oc) + Aw'+\cx), (22) 
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Figure 1: Comparison of the lower bounds used in 1ST and DAL. 



where w^^^{a) := prox^^^ (w^ + r]tA^ . The expression for the second derivative depends 
on the particular regularizer chosen. 

Notice again that the above update equation ()19p is very similar to the one in the 
1ST approach (Eq. ()14p ). However, —a, which is the slope of the lower-bound (jl6p is 
optimized in the inner minimization ()20p so that the lower-bound is the tightest at the 
next point w'^^^. In fact, if Vipt{a) = then Vf£{Aw^^^) = — ct* because of Eq. (j22]) 
and V/^(V/^*(— Q*)) = —a*. The difference between the strategies used in 1ST and DAL 
to construct a lower-bound is highlighted in Figure [TJ 1ST uses a fixed gradient-based 
lower-bound which is tightest at the current solution w*, whereas DAL uses a variational 
lower-bound, which can be adjusted to become tightest at the next solution w^~^^. 

The general connection between the augmented Lagrangian algorithm and t he proximal 



minimi z ation algor i thm, and (asymptotic) convergence results can be found in iRockafellar 



(|l976bl ;i: lBCTtsekasl h9Hi ). The derivation we show above is a special case when the objective 



function f{w) can be split into a part that is easy to handle (regularization term (f)x{'w)) 
and the rest (loss term /({Aw)). 



4. Exemplary instances 

In this section, we discuss special instances of DAL framework presented in Section [3] and 
qualitatively discuss the efficiency of minimizing the inner objective. We first discuss the 
simple case of £i-regularization (Section l4.ip . and then group-lasso (Section l4.2p and other 
more general regularization using the so-called support functions (Section 14. 3p . In addition, 
the case of component- wise regularization is discussed in Section [4.41 See also Tabled] for 
a list of regularizers. 
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Algorithm 1 DAL algorithm for £i-regularization 



1: Input: design matrix A, loss function /£, regularization constant A, sequence of prox- 
imity parameters ijt {t = 0, 1, 2, . . .), initial solution w^, tolerance e. 



Set t = 0. 
repeat 

Minimize the augmented Lagrangian function ^ti^t) (see Eq. (jlOp ) with the gradient 
and Hessian given in Eqs. ()24p and (|25p. respectively, using Newton's method. Let 
Q* be the approximate minimizer 



a ~ arg 



rgmin f (-a) + prox^L {w^ + VtA^o^) 
with the stopping criterion (see Section r5.2p 



where 'ft{oi-) is the derivative of the inner objective ([2l 

Update := prox^i^^(w* + r/tA^O!*), t ^ t + 1. 
until relative duality gap (see Section [7.1.20 is less than the tolerance e. 
Output: the final solution w^. 



4.1 Dual augmented Lagrangian algorithm for ^i-regularization 

A||ij;||i, the update equation (fT9|) can be rewritten as 



For the -regularization, 
follows: 



w 



t+1 



prox^i (w' + ryiA'^Q;* 



(23) 



where prox^^ is the proximity operator corresponding to the ^i-regularizer defined in Eq. Q. 
Moreover, noticing that the convex conjugate of the £i-regularizer is the indicator function 
6^ in Eq. ([5]), we can derive the envelope function in Eq. (|2ip as follows (see also 
Figure ED: 



Therefore, the AL function (jlOp in lTomioka and Sugivama (|2009l ) is derived from the prox- 
imal minimization framework (see Eq. (|20p ) in Section [3j 

We use Newton's method for the minimization of the inner objective ipt{a). The overall 
algorithm is shown in A lgorithm [H The gradient and Hessian of the AL function (|10p can 
be evaluated as follows (jTomioka and Sugivamal . I2OO9I I: 

Viftia) = -V/;(-a) + Aw'+\a), (24) 
V\t{a) = V2/;(-a) + VtA+A+^, (25) 

where w^~^^(a) := prox^^^^(iu* + rjtA^a), and A+ is the matrix that consists of columns of 
A that corresponds to "active" variables (i.e., the non-zero elements of w^'^^{a) ). Note 
that Eq. ()24p equals the general expression ()22p from the proximal minimization framework. 
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It is worth noting that in both the computation of matrix-vector product in Eq. (p^ 
and the computation of matrix- matrix product in Eq. (j25p . the cost is only proportional 
to the number of non-zero elements of w^~^^{a). Thus when we are aiming for a sparse 
solution, the minimization of Eq. (|10p can be performed efficiently. 



4.2 Group lasso 

Let 4>x be the group-lasso penalty ( Yuan and Lin . 20061 ). i.e., 



c^fiw) = X^\\w,\\, (26) 
ge© 

where © is a disjoint partition of the index set {1, . . . ,n}, and Wq G M'^I is a sub-vector 
of w that consists of rows of w indicated by g C {l,...,n}. The proximity operator 
corresponding to the group-lasso regularizer is obtained as follows: 

proxf (y) := prox^«(y) = (maxdlyj - A,0)^) , (27) 

where similarly to Eq. ([9]), (?/g)ge© denotes an n-dimensional vector whose g component is 
given by y^. Moreover, analogous to update equation ([23]) (see also Eq. ([TO]) ) in the ^i-case, 
the update equations can be written as follows: 

= proxf^^ (w^ + 7?t A^a' 

where a* is the minimizer of the AL function 

iptia) = m-oc) + 2_||proxf^^(^* + rjtA'cx)f. (28) 

The overall algorithm is obtained by replacing the soft-thresholding operations in Algo- 
rithm [1] by the one defined above (j27p . In addition, the gradient and Hessian of the AL 
function ^t{oi-) can be written as follows: 

V^t{cy) = -V/;(-a) + Aw'+\cx), (29) 

VVt(«) = VV;(-«) + r?, E ((l - + ^~q,qA A/, (30) 

ge©+ \\%\\J ll^gll / 

where i(;*+^(a) = prox®^^(i»;* -|- rjtA^ ex), and is a subset of & that consists of active 
groups, namely := {g G : ||iUg^"'^(Q!)|| > 0}; Ag is a sub-matrix of A that consists 
of columns of A that corresponds to the index-set g; /|g| is the |g| x |g| identity matrix; 
the vector q G M" is defined as q := + rjtA^ a and q^ := qg/Wq^W, where q^ is defined 
analogously to Wg. Note that in the above expression, Ar/t/||gg|| < 1 for g G (5~^ by the 
soft-threshold operation (p7|) . 

Similarly to the £i-case in the last subsection, the sparsity of it)*^^(a) (i.e., \&^\ ^ |(S|) 
can be exploited to efficiently compute the gradient (j29|) and the Hessian (pOj) . 
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4.3 Support functions 

The ^i-norm regularization and the group lasso regularization in Eq. (j26p can be generahzed 
to the class of support functions. The support function of a convex set Cx is defined as 
follows: 

(t)x{x) = sup x^y. (31) 



For example, the £i-norm is the support function of the ^oo unit ball (see Rockafellar ( 1970l )) 



and the group lasso regularizer is the support function of the group-generalized £oo-ball 
defined as {y E R" : Wy^W < A, Vg E &}. It is well know n that the convex conjugate of the 
support function (f3T]) is the indicator function of C (see Rockafellar ( 197d )). namely, 



I +00 (otherwise) . 

The proximity operator corresponding to the support function (j3ip can be written as follows: 

prox™P(y) :=y- proj(;^(y), 

where proj(^^ is the projection onto Cx; see Lemma[8]in Appendix|Al Finally, by computing 
the Moreau envelope ([2T]) corresponding to the above (p*^, we have 

^i(a) = /;(-a) + ^||prox^"P^(«;* + r?,ATa)f , (33) 

where we used the fact that for the indicator function in Eq. (|32p . (j)*^{proi(^^{z)) = (V2;) 
and Lemma El Note that letting prox^^^ = prox^^ and prox^^^ = prox® in Eq. ([55]) , we 
obtain Eq. and Eq. (j28]) . respectively. 



4.4 Handling different regularization constant for each component 

The £i-regularizer in Section [4. 11 and the group lasso regularizer in Section [4.21 assume that 
all the components (variables or groups) are regularized by the same constant A. However 
the general formulation in Section 13.31 allows using different regularization constant for each 
component. 

For example, let us consider the following regularizer: 

n 

i=i 

where Aj > (j = 1, . . . , n). Note that we can also include unregularized terms (e.g., a bias 
term) by setting the corresponding regularization constant Xj = 0. The soft-thresholding 
operation corresponding to the regularizer (134p is written as follows: 

prox^^ (y) = ( max(|yj| - Xj,0)-^ ] 

v \yj\/j=i 
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where again the ratio yj/\yj\ is defined to be zero if yj = 0. Note that if \j = 0, the soft- 
thresholding operation is an identity mapping for that component. Moreover, by noticing 
that the regularizer ([M|l is a support function (see Section . the envelope function 
in Eq. ([2T|) is written as follows: 

1 " 

which can also be derived by noticing that $^(0) = and V<I>^(y) = prox^^(y) (Lemma [TOl 
in Appendix 

As a concrete example, let h be an unregularized bias term and let us assume that all 
the components of w G M" are regularized by the same regularization constant A. In other 
words, we aim to solve the following optimization problem: 

minimize fAAw + \rnb) + \\\w\\i, 

where \\w\\i is the ^i-norm of w, and 1^ is an m-dimensional all one vector. The update 
equations (|19p and (j20p can be written as follows: 

w'+^ = woy^%^ (w' + 7?t A^Q*) , (35) 
= 6* + 7?jl„Ta*, (36) 

where ot* is the minimizer of the AL function as follows: 

= argmin f //(-a) + -3- (||prox^4 (i^* + cx)f + (6* + r^tlj cxf)] . (37) 



5. Analysis 



In this section, we first show the convergence of DAL algorithm assuming that the inner min- 
imization problem (1201) is solved exactly (Section 15. ip . which is equivalent to the proximal 
minimization algorithm (llip . The convergence is presented both in terms of the function 
value and the norm of the residual. Next, since it is impractical to perform the inner mini- 
mization to high precision, the finite tolerance version of the two theorems are presented in 
Section 15.21 The convergence rate obtained in Section 15.21 is slightly worse than the exact 
case. In Section [5. 3^ we show that the convergence rate can be improved by performing the 
inner minimization more precisely. Most of the proofs are given in Appendix [C] for the sake 
of readability. 

Our result is inspired partly bv beck and Teboullel (I2OO9I ') and is similar to the one 
given in " 



not require asymp totic arguments as in 'Rockaf ellail ( 1976a ) or rely on the strong convexity of 



Rockafellai ( 1976al ) and Kort and Bertsekas ( 1976 ). However, our analysis does 



the objective as in lKort and Bertsekasi (1976 ). Importantly the stopping criterion we discuss 
in Section [5.21 can be checked in practice. Key to our analysis is the Lipschitz continuity of 
the gradient of the loss function V/^ and the assumption that the proximation with respect 
to (/>A (see Eq. (I15p ) can be computed exactly. Connections between our assumption and 
the ones made in earlier studies are discussed in Section [53 
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5.1 Exact inner minimization 

Lemma 1 (jBeck and Teboulld (j200^)) Let ^ w^, . . . be the sequence generated by the 



proximal minimization algorithm (Eq. (jlip ). For arbitrary w G M" we have 

ritif{w'+') - fiw)) < ^\\w' - - ^\\w'+' - w\\\ (38) 

Proof First notice that (tu* — w^^^)/r]t G df{w^'^^) because w*'^^ minimizes Eq. ([TT]) . 
Therefore using the convexity of /, we hav^ 

mifiw) - f{w'+')) >{w- w'+\w' - w'+^) (39) 

= (^W — W^^^ , — W + W — 10*^"^) 

> \\w - w^^^W'^ - \\w - II - lull 

> -\\w - - -||w* - w\\'^, 

where the third hue foUows from Cauchy-Schwartz inequahty and the last hue fohows from 
the inequahty of arithmetic and geometric means. ■ 

Note that DAL algorithm (Eqs. (|19p and (|20p) with exact inner minimization generates 
a sequence from the proximal minimization algorithm (Eq. (jlip ). Therefore we have the 
following theorem. 

Theorem 2 Let w^,w'^,... be the sequence generated by DAL algorithm (Eqs. ()19p and 
(j20p ); let W* be the set of minimizers of the objective ([1]) and let f{W*) denote the minimum 
objective value. If the inner minimization (Eq. (|2Up ) is solved exactly and the proximity 
parameter rjt is increased exponentially, then DAL algorithm converges exponentially as 
follows: 

||ii,0 _ W*||2 

fiw"^') - fiW*) < " ^J' " , (40) 

where \\w^ — W*\\ denotes the minimum distance between the initial solution and W* , 
namely, \\w^ — W*\\ = min^y.gvK* 11'"^'^ — ^ote that = Yld=o'^i '^^^'^ grows exponen- 
tially. 

Proof Let w* be any point in W*. Substituting w = w* in Eq. (j38p and summing both 
sides from t = 1 to t = k, we have 

7?f/(w*+^) _ ^ 1 ii^^^o _ ^*||2 _ _ ^y*||2 



(Etc..) E^P^-^<"'*'H5«"'° 



Et=oVt / 2 2 

1, 



^ - II * ||2 

< -\\w — W \\ . 



6. We use the notation {x, y) := X]?=i ^iVj foi" ^jV & 
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In addition, since f{w^^^) < f{w^) (t = 0, 1, 2, . . .) from Eq. pT]) . we have 

{EloVt) {f{w''+')-f{w*))<\\\w^-w*f. 

Finally, taking the minimum of the right-hand side with respect to w* £ W* and using the 
equivalence of proximal minimization (jlip and DAL algorithm (jl9p - (|20p (see Section [3.3p . 
we complete the proof. ■ 



The above theorem claims the convergence of the residual function values /(lu*) — f{w*) 
obtained along the sequence xi,X2, .... We can convert the above result into convergence 
in terms of the residual norm — by introducing an assumption that connects the 
residual function value to the residual norm. In addition, we slightly generalize Lemma [T] 
to improve the convergence rate. Consequently, we obtain the following theorem. 

Theorem 3 Let w^,w'^,... be the sequence generated by DAL algorithm (Eqs. (jl9p and 
(I20p ) and let W* be the set of minimizers of the objective ([T]). Let us assume that there are 
a positive constant a and a scalar a (1 < a < 2) such that 

(Al) f{w'+') - f{W*) > a\\w'+' - W*r it = 0, 1, 2, . . .), (41) 

where f{W*) denotes the minimum objective value, and \\w — W*\\ denotes the minimum 
distance between w G M" and the set of minimizers W* as \\w—W*\\ := mm^t^yy, \\w—w*\\. 
If the inner minimization is solved exactly, we have the following inequality: 

-W*\\+ crr]t\\w*+'^ - W*\\"-'^ < -W*\\. 

Moreover, this implies that 

1 + {q: — l)(TTjf 1 

\\w^+^-W*\\ 1+-"* < \\w^-W*\\. (42) 

That is, to* converges to W* super-linearly if a < 2 or a = 2 and rjt is increasing, in a 
global and non-asymptotic sense. 

Proof See Appendix I C.li ■ 

Note that the above super-linear convergence holds without the assumption in Theorem [2] 
that rjt is increased exponentially. 

5.2 Approximate inner minimization 

First we present a finite tolerance version of Lemma [TJ 

Lemma 4 Let w^,w'^, . . . be the sequence generated by DAL algorithm (Eqs. ()19p and (j20p ). 
Let us assume the following conditions. 

(A2) The loss function fi has a Lipschitz continuous gradient with modulus I/7, i.e., 

||V/^(z)-V/,(^')|| < ^ll^-^'ll (V^,z'gM™), (43) 
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(A3) The proximation with respect to (see Eq. (jlSp ) can he computed exactly. 
(A4) The inner minimization (Eq. (j20p ) is solved to the following tolerance: 

\\V^t{cx')\\<^W+^-w\ (44) 



where 7 is the constant in Eq. ()43p . 
Under assumptions (A2)-(A4), for arbitrary w G R" we have 

Vtifiw'^') - f{w)) < ^\\w' - wf - ^\\w'+' - wf. (45) 

Proof See Appendix [Ol ■ 

Note that Lemma Estates that even with the weaker stopping criterion (A4), we can obtain 
inequahty (I45p as in Lemma [T] 

The assumptions we make here are rather weak. In Assumption (A2), the loss function 
fe does not include the design matrix A (see Table [3]). Therefore, it is easy to compute 
the constant 7. Accordingly, the stopping criterion (A4) can be checked without assuming 
anything about the data. 

Furthermore, summing both sides of inequality (j45p and assuming that r]t is increased 
exponentially, we obtain Theorem [2] also under the approximate minimization (A4). 

Finally, an analogue of Theorem [3l which does not assume the exponential increase in 
rjt, is obtained as follows. 

Theorem 5 Let w^,w^,... be the sequence generated by DAL algorithm and let W* be 
the set of minimizers of the objective Under assumption (Al) in Theorem and 

(A2)-(A4) in Lemma\^ we have 

_ w*f + 2crr/t||w*+i - W*\\°' < \\w^ - W*f, (46) 

where \\w^ — W*\\ is the minimum distance between andW* as in Theorem\^ Moreover, 
this implies that 

l-\-a.ar)i ]^ 

That is, converges to W* super-linearly if a < 2 or a = 2 and r]t is increasing. 

Proof Let be the closest point in W* from w^, i.e., := argmin^»gy(/» \\w^ — w*\\. 
Using Lemma m with w = and Assumption (Al), we have the first part of the theorem 
as follows: 

\\w^ - W*f = - w*f > - w^f + 2ar]t\\w^+^ - W*^ 

> ||k;*+^ - W*f + 2ar]t\\w^+^ - W*]]", 

where we used the minimality of — iLj*^^!! in the second line. The last part of the 

theorem (I47p can be obtained in a similar manner as that of Theorem [3] using Young's 
inequality (see Appendix IC.ip . ■ 
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5.3 A faster rate 

The factor 1 j yj\ + 2ar]i obtained under the approxhnate minimization ( A4) (see inequal- 
ity ()47p in Theorem [5|) is larger than that obtained under the exact inner minimization 
(see inequality ()42p in Theorem [3]); i.e., the statement in Theorem [5] is weaker than that in 
Theorem [3l 

Here we show that a better rate can also be obtained for approximate minimization if 
we perform the inner minimization to 0(||i(;*^^ — w^\jT]t) instead of 0(||i(;*^^ — 
in Assumption (A4). 



Theorem 6 Let w^,w'^, . . . be the sequence generated by DAL algorithm and let W* be the 
set of minimizers of the objective ([T|). Under assumption (Al) in Theorem with a = 2, 
and assumptions (A2) and (A3) in Lemma^ for any e < 1 such that 5 := {1 — e)/{arit) < 
3/4, if we solve the inner minimization to the following precision 

(A4') \\VMc^')\\ < _ ^*||, 

Vt 

then we have 

_ W*\\ < \\W* - W*\\. 

1 + earjt 



Proof See Appendix IC. 31 ■ 

Note that the assumption 6 < 3/4 is rather weak, because if the factor 5 is greater than 
one, the stopping criterion (A4') would be weaker than the earlier criterion (A4). In order 
to be on the safe side, we can choose e = max(eo, 1 — 3arjt/4:) (assuming that we know 
the constant a) and the above statement holds with e = cq. Unfortunately, in exchange 
for obtaining a faster rate, the stopping criterion (A4') now depends not only on 7, which 
can be computed, but also on a, which is hard to know in practice. Therefore stopping 
condition (A4') is not practical. 



5.4 Validity of assumption (Al) 

In this subsection, we discuss t he val idity of assumption (Al) and i ts relation to the as- 
sumptions used in iRockafellail (Il976al ) and lKort and BertsekasI (Il97m Roughly speaking, 



Rockafellai (|l976al l and stronger than 



our assumption (Al) is milder than the o ne used in 
the one used in iKort and BertsekasI (|l976l ). 

First of all, assumption (Al) is unnecessary for convergence in terms of function value 
(Theorem [2] and its approximate version implied by Lemma H]) . Exponential increase of 
the proximity parameter r/t may sound restrictive, but this is the setting we typically use 
in experiments (see Section l7.ip . Assumption (Al) is only necessary in Theorem [3] and 
Theorem [5] to translate the residual function value f{w^^^) — f{W*) into the residual 
distance — W*\\. 

We can roughly think of Assumption (Al) as a local strong convexity assumption. Here 
we say a function / is locally strongly convex around the set of minimizers W* if for all 
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positive C, all the points w within distance C from the set W* , the objective function / is 
bounded below by a quadratic function, i.e., 

f{w)-f{W*)>cj\\w-W*f (iw -.Ww -W*\\ <C), (48) 

where the positive constant a may depend on C. If th e set of minimizers W* is bounded, all 
the level sets of / are bounded (see iRockafellar (|l97d . Theorem 8.7)). Therefore, if we make 



sure that the function value f{w^) does not increase during the minimization, we can assume 
that all points generated by DAL algorithm are contained in some neighborhood around W* 
that contains the level set defined by the initial function value {w G M" : f{w) < /(w")}; 
i.e., the local strong convexity of / guar antees Assumption (Al) with a = 2. 

Note that Kort and Bertsekas (|l976l . p278) used a slightly weaker assumption than the 



local strong convexity (I48p : they assumed that there exists a positive constant C" > such 
that the local strong convexity (|48p is true for all w in the neighborhood \\w — W*\\ < C 
for some a > 0. 

The local strong convexity (j48p or Assumption (Al) fails when the objective function 
behaves like a constant function around the set of minimizers W*. In this case, DAL con- 
verges rapidly in terms of function value due to Theorem [21 however it does not necessarily 
converge in terms of the distance \\w^ — W*\\. 

Note that the objective function / is the sum of the loss term and the regularization 
term. Even if the minimum eigenvalue of the Hessian of the loss term is very close to zero, 
we can hope that the regularization term holds the function up from the minimum objective 
value f{W*). For example, when the loss term is zero and we only have the £i-regularization 
term (l)^^{w). The objective f{w) = \ Yl^=i l^il be lower-bounded as 

f{w)>h\wf {yw:\\w\\<C), 

where the minimizer w* is w* = 0. Note that the £i-regularizer is not (globally) strongly 
convex. The same observation holds also for other regularize rs we discus s ed in S ection HI 
In the context of asymptotic analysis of AL algorithm, IRockafellar (Il976bl ^ assumed 



that there exists r > 0, such that in the ball ||/3|| < r in M", the gradient of the convex 
conjugate /* of the objective function / is Lipschitz continuous with constant L, i.e., 

l|Vr(/3)-Vr(0)||<L||/3||. 



Note that because df{Vf*{0)) B (jRockafellail . Il97d . Cor. 23.5.1), V/*(0) is the optimal 



solution w* of Eq. ([T|), and it is unique by the continuity assumed above. 

Our assumption (Al) can be justified from Rockafellar's assumption as follows. 

Theorem 7 Rockafellar's assumption implies that the objective f is locally strongly convex 
with C = ctL and a = min(l, (2c — l)/c^)/(2L) for any positive constant c (t and L are 
constants from Rockafellar's assumption). 



Proo f The proof is a Zoca/ version of the proof of Theorem X. 4. 2. 2 in Hiriart-Urruty and Lemarechal 



( I993I ) (Lipschitz continuity of V/* implies strong convexity of /). See Appendix IC. 41 



Note that as the constant c that bounds the distance to the set of minimizers W* increases. 
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the constant a becomes smaller and the convergence guarantee in Theorem [3] and [5] become 
weaker (but still valid). 

Nevertheless Assumption (Al) we use in Theorem [3] and [5] are weaker than the local 
strong convexity (08]), because we need Assumption (Al) to hold only on the points gen- 
erated by DAL algorithm. For example, if we only consider a finite number of steps, such 
a constant a always exists^ 

Both assumptions in iRockafellai] (|l976bl l and iKort and BertsekasI (|l976l l are made for 
asymptotic analysis. In fact, they require that as the optimization proceeds, the so- 
lution becomes closer to the optimu m w* in the sense of the distance \\w^ — W*\\ in 
Kort and BertsekasI (Ii97(tI 1 and in iRockafellari (Il976bl ). However in both cases, it is 



hard to predict how many iterations it takes for the solution to be sufficiently close to the 
optimum so that the super-linear convergence happens. 

Our analysis is complementary to the above classical results. We have shown that super- 
linear convergence happens non- asymptotically under Assumption (Al), which is trivial for 
a finite number of steps, for DAL algorithm. Moreover, Assumption (Al) can be justified 
for any number of steps by the local strong convexity around W* 



6. Previous studies 

In this section, we discuss earlier studies in two categories. The first category comprises 
methods that try to overcome the difficulty posed by the non-differentiability of the regu- 
larization term (f)\{w). The second category, which includes DAL algorithm in this paper, 
consists of methods that try to overcome the difficulty posed by the coupling (or non- 
separability) introduced by the design matrix A. The advantages and disadvantages of all 
the methods are summarized in Tabled! 



6.1 Constrained optimization, upper-bound minimization, and subgradient 
methods 

Many authors have focused on the non- differentiability of the regularization term in order 
to efficiently minimize Eq. ([1]). This view has lead to three types of approaches, namely, (i) 
constrained optimization, (ii) upper-bound minimization, and (iii) subgradient methods. 

In the constrained optimization approach, auxiliary variables are introduced to rewrite 
the non-differentiable regularization term as a linear function of conically-constrained aux- 
iliary variables. For example, the ^i-norm of a vector w can be rewritten as: 

n 

||iu||i = min ( ^j"^^ + '"^j I Wj=w^^^—w^j \ 



where 'U^j^'' and w)j ' {j = 1 , . . . , n) are auxiliary variables and they are constrained in 
the positive-orthant cone. Two major challenges of the auxiliary-variable formulation are 
the increased size of the problem and the complexity of solving a constrained optimization 

problem. 

The projected gradient (PG) method (see Bertsekas ( 19991 )) iteratively computes a gra- 



,(-) 



dient step and projects it back to the constraint-set. The PG method in iFigueiredo et al 
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converges R-linearlj0, if the loss function is quadratic. However, PG methods can 
be extremely slow when the design matr ix A is poor l y con ditioned. To overcome the scal- 



ing problem, the L-BFGS-B algorithm (iByrd et al.l . Il995l ) can be applied for the simple 



positive orthant constraint that arises from the ii minimization. However, this approach 
does not easily extend to more general regularizers, such as group lasso and trace norm 
regularization. 



The interior-point (IP) method (see iBovd and Vandenberghd (120041 ) ) is a n other algo- 
rithm that is often used for constrained minimization; see iKoh et al.1 (|2007l ): iKim et al 



(|2007l ) 'or the application of IP methods to sparse estimation problems. Basically an IP 
method generates a sequence that approximately follows the so called central path, which 
parametrically connects the analytic center of the constraint-set and the optimal solution. 
Although IP methods can tolerate poorly conditioned design matrices well, it is challeng- 
ing to scale them up to very large dense problems. The convergence of the IP method in 
Koh et al. (j2007l ) is empirically found to be linear. 

The second approach (upper-bound minimization) constructs a differentiable upper- 
bound of the non-differentiable regularization term. For example, the ^i-norm of a vector 
w can be rewritten as follows: 



w 



E 



mm — ^ 

Oj>o \ 2a j 



+ 



(49) 



In fact, the right-hand side is an upper bound of the left-hand side for arbitrary non- 
negative aj due to the inequality of arithmetic and geometric means, and the equality is 
obtained by setting aj = \wj\. The advantage of the above parametric-upper-bound for- 
mulation is that for a fixed set of Oj, the problem ([2]) becomes a (weighted) quadratically 
regularized minimization problem, for which va rious efficient algorithins already exist. The 
iteratively reweighted s hrinka ge (IRS) method ( Gorodnitsky and Raol . 19971 : Bioucas-Dias . 
20061 : iFigueiredo et al.1 . l2007al ) alternately solves the quadratically regularized minimization 
problem and tightens (re- weights) the upper-bound in Eg . (1491). A r nore general technique 



was studied i i i para llel by the name of variational EM ([Jaakkolal . 119971 : iGirolamil . 12001 



Palmer et al 



.20061). which generalizes the above upper-bound using Fench el's inequal- 

itv (lRockafellail . ll97d '). A similar approach that is based on Jens en's inequalitv (iRockafellan . 
19701') has been studied i n the context of multiple-kernel learning (jMicchelli and Pontii 12005 : 
Rakotomamoniv et al. . 20081 ) and in the context of multi-task l earning (Argvriou et al 



20071 . I2OO8I ) . The challenge in the IRS framework is the singularity (jFigueiredo et al.l . l2007al ) 
around the coordinate axis. For example, in the £i-problem in Eq. ([2]), any zero component 
Wj = in the initial vector w will remain zero after any number of iterations. Moreover, 
it is possible to create a situation that the convergence b ecomes arbitrarily slow for f inite 



because the convergence in the ii case is only linear (jGorodnitskv and Raol . 119971 ). 



The third approach (subg radient methods ) directly handles the non-differentiability 
through subgradients; see e.g.. iBertsekas (I1999I ). 

A (stochastic) subgradient method typically converges as 0{l/Vk) for non-smooth prob- 



lems in general and as 0(1/ k) if the objective is strongly convex; see IShalev-Shwartz et al 



7. A sequence ^' converges to ^ R-lin early (R is for "root" ) if th e residual 
e* tfiat linearly converges to zero l|Nocedal and Wriglitl[l999l ). 



- ^1 is bounded by a sequence 
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Table 1: Comparison of the algorithms to solve Eq. ([T|). In the columns, six methods, 
namely, projected gradient (PG), interior point (IP), iterative reweighted shrinkage 
(IRS), orthant-wise limited- memory quasi Newton (OWLQN), accelerated gradi- 
ent (AG), and dual augmented Lagrangian (DAL), are categorized into four groups 
discussed in the text. The first row: "Poorly conditioned A" means that a method 
can tolerate poorly conditioned design matrices well. The second row: "No singu- 
larity" means that a method does not suffer from singularity in the parametrization 
(see main text). The third row: "Extensibility" means that a method can be easily 
extended beyond £i-regularization. The forth row: "Exploits sparsity of w" means 
that a method can exploit the sparsity in the intermediate solution. The fifth row: 
"Efficient when" indicates the situations each algorithm runs efficiently, namely, 
more samples than unknowns (m ^ n), more unknowns than samples (m <^ n), 
or does not matter (-). The last row shows the rate of convergence known from 
literature. The super-linear convergence of DAL is established in this paper. 





Constrained 


Upper-bound 


Subgradient 
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Optimization 


Minimization 
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/ 




/ 


/ 


/ 


Extensibility 


/ 


/ 


/ 
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/ 
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/ 
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n 


n 


m <^n 


Convergence 


(0(e-*)) 


(0(e-'=)) 




? 


0(1/P) 





However, s nee the method is based on gr adients, it can easily fail 



when the problem is poorly conditioned (see e.g., (jYu et al.l . l20ld . Section 2.2)). Therefore 



one of the challenges in subgradient-based approaches is to take the second-order curvature 
information into account. This is especially important to tackle large-scale problems with 
a possibly poorly conditioned desi gn matrix. Ortha nt-wise limited memory quasi Newton 
(OWLQN. [Andrew and Gaol jioO^)) and subLBFGS (IYu et al.l.l20ld^ combine s ubgradients 
with the well known L-BFGS quasi Newton method ( Nocedal and Wright . 19991 ). Although 
being very efficient for £i-regularization and piecewise linear loss functions, these methods 
depend on the efficiency of oracles that compute a descent direction and a step-size; there- 
fore, it is challenging to extend these methods to combinations of general loss functions and 
general non-differentiable regularizers. In addition, the convergence rates of the OWLQN 
and subLBFGS methods are not known. 



6.2 Iterative proximation 

Yet another approach is to deal with the nondifferentiable regularization through the prox- 
imity operation. In fact, the proximity operator (I15p is easy to compute for many practically 
relevant separable regularizers. 
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The remaining issue, therefore, is the couphng between variables introduced by the de- 
sign matrix A. We have shown in Sections 13. 21 and 13.31 that 1ST and DAL can be considered 
as two different strategies to remove this couphng. 

Recently m any studies have focused on methods that itera,tively compute the proximal 
operation (fTSl) (iFigueiredo and Nowakl.l2003l:lDaubechies et al.U2004l : ICombettes and Waisl . 
2nn,'^ : iNesteravl 120071 : iBeck and Teboullel . boOfll : ICai et all . 1 20081 ;) . which can be described in 
an abstract manner as follows: 



= prox^ (y*) , 



(50) 



where prox^^^ is the proximal operator defined in Eq. (jl5p . The above mentioned studies 
can be differentiated by the different and At that they use. 



1979: 


Combettes and Wais. 


20051; 


Duchi and Singer. 



■.= w' -7]tA^Vfi{Aw'), 
Xt := \r]t. 

What we need to do at every iteration is only to compute the gradient at the current point, 
take a gradient step, and then perform the proximal operation (Eq. ()50p ). Note that r]t can 
be considered step-size. 

The 1ST method can be considered as a generalization of the projected gradient method. 
Since the proximal gradient step ()13p reduces to an ordinary gradient step when = 0, 
the basic idea behind 1ST is to keep the non-smooth term (p\ as a part of the proximity 
step (see|La3 toid )). Consequently, the convergence behaviour of 1ST i s the same as that 
of (pr ojected) gradient descent on the differentiable loss term. Note that iDuchi and Singer 
(120091 ) analyze the case where the loss term is also non-differentiable in both batch and 
online learning settings. Langford et al. ( 20091 ) also analyze the online setting with a more 
general threshold operation. 

1ST approach maintains sparsity of throughout the optimization, which results in 
significant reduction of computational cost; this is an advanta g e of i terative proximation 
methods compared to interior-point methods (e.g., iKoh et al.l (j2007l )). because the solu- 



tion produced by interio r -point methods becomes sparse only in an asymptotic sense; see 
Bovd and Vandenberghe ( 20041 ) . 

The downside of the 1ST approach is the difficulty to choose the step-size parameter r]t; 
this issue is especially problematic when the design matrix A is poorly c onditioned. In addi- 
tion, the best known convergence rate of a naive 1ST approach is 0(1/A;) (jBeck and Teboulld . 
20091 ). which means that the number of iterations k that we need to obtain a solution w 
such that f{w^) — f{w*) < e grows linearly with 1/e, where f{w*) is the minimal value of 
Eq. (P). 

SpaRSA dWright et al.l . l2009l ) uses approximate second order curvature inforrnation for 
the selection of the step-size parameter rjt- TwIST ( Bioucas-Dias and Figueiredo . 20071 ) is 
a "two-step" approach that tries to alleviate the poor efficiency of 1ST when the design 
matrix is poorly conditioned. However the convergence rates of SpaRSA and TwIST are 
unknown. 
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Ac celer ating strategies that use different choices of have been proposed in iNesterov 
(|2007l ) and lBeck and TebouUel (|2009l l (denoted AG in Tab.H]), which h ave 0( 1 /fc^) guarantee 
with almost the same computational cost per iteration; see alsolL^ \2Qld ). 



DAL can be considered as a new member of the family of iterative proximation al- 
gorithms. We have qualitatively shown in Section 13.31 that DAL constructs a better lower 
bound of the loss term than 1ST. Moreover, we have rigorously studied the convergence rate 
of DAL and have shown that it converges super-linearly. Of course the fast convergence of 
DAL comes with the increased cost per iteration. Nevertheless, as we have qualitatively 
discussed in Section HI this increase is mild, because the sparsity of intermediate solutions 
can be effectively exploited in the inner minimization. We empirically compare DAL with 
other methods in Section [71 

There is of course an issue on how much one should precisely optimize when the train- 
ing error (plus the regularizatiq r i term ) is a crude approximation of the generalization er- 
ror ( Shalev-Shwartz and Srebrol . 20081 ) . However the reason we use sparse regularization is 
exactly that we are not only interested in the predictive power. We argue that when we are 
using sparse methods to gain insights into some problem, it is important that we are sure 
that we are doing what we write in our paper (e.g., "solve an ^i-regularized minimization 
problem"), and someone else can reliably recover the same sparsity pattern using any opti- 
mization approach that employs some objective stopping criterion such as th e duality gap. 
Of co u rse the stability of the optinia,l solu tion itself must be analyzed (see I Zhao and Yu 
(|2006l ): iMeinshausen and Biihlmannl Jiooi)) and the trade-off between accuracy and spar- 
sity should be discussed. However, this is beyond the scope of this paper. 



7. Empirical results 



In this section, we confirm the super-linear convergence of DAL algorithm and compare 
it with other algorithms on £i -regularized logistic re gression pr oblems. The algorithra s 



that we c ompare are FISTA (IBeck and Teboulld. 12 009). OWLQN (lAndrew and Cxaol. l2007l) 



SpaR SA (jWright et ai] . l2009l ). IRS (jFigueiredo etal.. . ,2007 a). and L1_L0GREG (|Koh et al. 



20071 ) . Note that 1ST is not included because SpaRSA and FISTA are shown to clearly out- 
perform the naive 1ST approach. We describe the logistic regression problem and the im- 
plementation of all of the methods in Section 17.11 The synthetic experiments are presented 
in Section 17.21 and the benchmark experiments are presented in Section 17.31 



7.1 Implementation 

In this subsection, we first describe the problem to be solved and then explain the imple- 
mentation of the above mentioned algorithms in detail. 

For all algorithms except for IRS, the initial solution was set to an all zero vector. For 
IRS, the initial solution was sampled from an independent standard Gaussian distribution. 

The CPU time was measured on a Linux server with two 3.1 GHz Opteron Processors 
and 32GB of RAM. 
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7.1.1 ^i-REGULARIZED LOGISTIC REGRESSION 

The logistic regression model is defined by the loss function 



/LR(^) = X^log(l + e-^'^0, (51) 



i=l 



where j/j G {— Ij +1} is a training label. The conjugate of the loss function can be obtained 
as follows: 

m 

/lr(-") = X] ("'^^ log(Qiyi) + (1 - aiVi) log(l - aiyi)) . 

i=l 

Rewriting the dual problem ([3]) we have the following expression: 

maximize — fiTfi—a), (52) 

subject to ||A^Q;||oo < A, (53) 

where ||y||oo = '^^'^i=\,...,n \yj\ is the ^oo-norm; note that the implicit constraint in Eq. ([3]) 
(through the indicator function (5^) is made explicit in Eq. (|53|) . 

For the experiments in this section, we reparametrize the regularization constant A as 
A = A|| A^t/lloo- The reason for this reparametrization is that for all A > 0.5 the solution w 
can be shown to be zero; thus we can measure the strength of the regularization relative to 
the problem using A instead of A. This is because the conjugate loss function takes the 
minimum at = ?/j/2 and the minimum is attained for A > || A~'^(y/2)||oo (see Eq. ([53]) ). 

7.1.2 Duality gap 

We used the relative duality gap (RDG) as a stopping criterion with tolerance 10~^. More 
specifically, we terminated all the algorithms described below when RDG fell below 10~^. 
RDG was computed as follows for all algorithms except L1_L0GREG. For L1_L0GREG, 
we modified the stopping criterion implemented in the original cod e by the autho r s from 
absolute duality gap to relative duality gap. See also lKoh et ^ hm'h ; IWright et al] jioO^) ; 
Tomioka and Sugiyama (|2009l l. 



Let a* be any candidate dual vector at ith iteration. For example, o;* = ct* for DAL 
and a* = -\l f (,{Aw^^^) for OWLQN, SpaRSA, and IRS. Note that the above a* does not 
necessarily satisfy the dual constraint (f53]l . Thus we define a* = q* min(l, A/|| A'''ci;*||oo)- 
Notice that ||A^q;*||oo < A by construction. We compute the dual objective value as 
d(i(;*+i) = -/;(-a*); see Eq. ([52]). Finally RDG*+^ is obtained as RDG*+i = (/(iu*+i) - 
(i(i(;*+^))//(i(;*+^), where / is the primal objective function defined in Eq. ([TJ. 

The norm of the minimum norm subgradient is also frequently used as a stopping cri- 
terion. However, there are two reasons fo r using RDG instead. Fir st, the gradient at the 
current point is not evaluated in FISTA ([Beck and Teboullel . and it requires addi- 



tional computation, whereas the vector a* in the computation of RDG does not need to be 
the gradient at the current point; in fact the gradient at any point (or any m-dimensional 
vector) gives a valid lower bound of the minimum objective value. Second, since the gra- 
dient can change discontinuously at non differentiable points, the norm of gradient does 
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not reflect the distance from the solution well; this is a problem for e.g., an interior-point 
method, because it produces a sparse solution only asymptotically. 

7.1.3 DAL 

DAL algorithm is implemented in MATLAE@. The inner minimization problem (see Eq. (jlOp ) 
is solved with Newton's method; we used the preconditioned conjugate gradient (PCG) 
method for solving the associated Newton system (peg function in MATLAB); we use the 
diagonal elements of the Hessian matrix (see Eq. (|25p ) as the preconditioner. The inner 
minimization is terminated by the criterion (j44p with 7 = 4, because the Hessian of the loss 
function (jSTj) is uniformly bounded by 1/4 (see Table [3|) . 

We chose the initial proximity parameter to be either ryo = 0.01 /A (conservative setting) 
or 7/0 = 1/A (aggressive setting) and increased T]t by a factor of 2 at every iteration. Since r]t 
appears in the soft-thresholding operation multiplied by A, it seems to be intuitive to choose 
r]t inversely proportional to A but we do not have a formal argument yet. We empirically 
discuss the choice of rjQ in more detail in Section 17.2.41 

The algorithm was terminated when the RDG fell below 10~^. 



7.1.4 DAL-B 

DAL-B is a variant of DAL with an unregularized bias term (see upda te equation s pSD - 



(|37l) ). This algorithm is included because L1_L0GREG implemented by iKoh et al.l (120071 ) 
estimates a bias term and therefore cannot be directly compared to DAL. 

As an augmented Lagrangian algorithm, DAL-B solves the following dual problem: 

maximize - flR{-a) - Sf (v), (54) 

subject to A^a. = v, (55) 
l^ct = 0. (56) 

See also Eqs. ([3]) and (g]). 

When implementing DAL-B, we noticed that sometimes the algorithm gets stuck in a 
plateau where the additional equality constraint (j56p improves very little. This was more 
likely to happen when the condition of the design matrix was poor. 

In order to avoid this undesirable slow-down, we heuristically adapt the proximity pa- 
rameter ijt for the equality constraint (j56p . Note that this kind of modification cannot 
improve the theoretical convergence result without additional prior information. More 
specifically, we use proximity parameters r][^^ and rj^'^^ for equality constraints (j55p and 
()56p . respectively. The AL function (I37p is rewritten as follows 

a* = argmin ( //(-«) + ^--||prox^;^ (w* + ritA^a)\f + ^7^(6* + Vt^rrJ otf ] ■ 

First we initialize rl^^ = rj^^^ = 0.01/A (conservative setting) or rj^^^ = t]^^ = 1/A (aggressive 
setting) as above. The proximity parameter r/j^^ with respect to Eq. (j55]) is increased by the 



8. The software is available from http : / /www . ibis . t . u-tokyo . ac . jp/ryotat/dal/. 
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(2') 

factor 2 at every iteration (the same as DAL). The proximity parameter r/^ with respect 



to Eq. (I56p is increased by a larger factor 40 if the fohowing conditions are satisfied: 

1. The iteration counter t>\. 

2. The violation of the equality constraint ()56p . namely viol* := |1'''q*|, does not suffi- 
ciently decrease; i.e., viol* > viol*~^/2. 

3. The violation viol* is larger than 10^'^ (the tolerance of optimization). 

Otherwise, vif'^ is increased by the same factor 2 as y]^'^ . 

Note that the theoretical results in Section [5] still holds if we replace rjt in Section [5] by 
'rlj^\ because 77}^^ < 77^^; i.e., the stopping criterion (j44p and the convergence rates simply 
become more conservative. 

Since DAL-B has an additional equality constraint (j56p . We modified the computation 
of relative duality gap described above by defining the candidate dual vector a* as a* = 
n-* 3-1 1 T t 

m ™ ■ 

7.1.5 Fast Iterative Shrinkage- Thresholding Algorithm (FISTA) 

FISTA algorithm (jSeck and Teboulld . I2OO9I ) is implemented in MATLAB. The algorithm 



is terminated by the same RDG criterion except that the dual objective is evaluated at y* 
in update equation (jSOp instead of w*^^\ this approach saves unnecessary computation of 
gradients. 

7.1.6 Orthant-wise limited memory quasi Newton (OWLQN) 

OWLQN algorithm dAndrew and Gaol . boOTj ) is also implemented in MATLAB because 



we found that our MATLAB implementation was faster than the C++ implementation 
provided by the authors; this is because MATLAB uses optimized linear algebra routines 
while authors' implementation does not. The algorithm is terminated by the same RDG 
criterion as DAL. 

7.1.7 Sparse Reconstruction by Separable Approximation (SpaRSA) 

SpaRSA algorithm dWright et alLbonfll ^ is implemented in MATLAB. We modified the code 



provided by the authorqj to handle the logistic loss function. The algorithm is terminated 
by the same RDG criterion. 

7.1.8 Iterative reweighted shrinkage (IRS) 

IRS algorithm is implemented in MATLAB. At every iteration IRS solves a ridge-regularized 
logistic regression problem with the regularizer defined in Eq. (I49p . This problem can be 
converted into a standard ^2-i'egularized logistic regression with the design matrix A = 
Adiag(y^ai, . . . , ^/o^^) by reparametrizing Wj to Wj = Wjj .^Jaj. The weight aj is set to |tt;*| 
before solving the problem. Thus if any w*- = 0, the corresponding column of A becomes 
zero and it can be rer noved from the optimizatio n. We use the limited memory BFGS 
quasi-Newton method ( Nocedal and Wright . 19991 ) to solve each sub-problem. 



9. http : //www . Ix . it .pt/~mtf /SpaRSA/ 
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7.1.9 Interior point algorithm (L1_L0GREG) 

L1_L0GREG algorithm jKoh et al.1 . [ioOTl l is implemented in C. We modified the code 
provided by the authorj^^l as a C-MEX function so that it can be called directly from 
MATLAB without saving matrices into files. We used the BLAS and LAPACK libraries 
provided together with MATLAB R2008b (-Imwblas and -Imwlapack options for the mex 
command). L1_L0GREG is also terminated by the RDG criterion. 

Note that L1_L0GREG also estimates an unregularized bias term. DAL algorithm with 
a bias term (DAL-B) is included to make the comparison easy; see Section 17.1.41 



7.2 Synthetic experiment 

In this subsection, we first confirm the convergence behaviour of DAL (Section l7.2.2p : second 
we compare the scaling of various algorithms against the size of the problem (Section l7.2.3p : 
finally we discuss how to choose the initial proximity parameter rjQ (Section 17. 2. 4p . 



7.2.1 Experimental setting 

The elements of the design matrix A G j^^x" were randomly sampled from an independent 
standard Gaussian distribution. The true classifier coefficient /3 was generated by filling 
randomly chosen element (4%) of a n-dimensional vector with samples from an independent 
standard Gaussian distribution; the remaining elements of the vector were set to zero. The 
training label vector y was obtained by taking the sign of A/3 + 0.01^, where ^ G was 
a sample from an m-dimensional independent standard Gaussian distribution. The whole 
procedure was repeated ten times. 



7.2.2 Empirical validation of super-linear convergence 

In this section, we empirically confirm the validity of the convergence results (Theorems [21 [3] 
and[5]) obtained in the previous section and compare the efficiency of DAL, FISTA, OWLQN, 
SpaRSA, and IRS for the number of samples m = 1, 024 and the number of parameters n = 
16, 384. L1_L0GREG is not included because it solves a different minimization problem. 
We use the regularization constant A = 0.01. For DAL, we used the aggressive setting 
(ryi = l/A,2/A,4/A,...). 

First in order to obtain the true minimizei0 w* of Eq. ([1]), we ran DAL algorithm 
to obtain a solution with high precision (RDG < 10~^). Assuming that the support of 
this solution is correct, we performed one Newton step of Eq. ([T]) in the subspace of active 
variables. The solution w* we obtained in this way satisfied \\Vf{w*)\\ < IQ-^^, where 
V/(io*) is the minimum norm subgradient of / at w* . The parameter a in Eq. ()4ip was 
estimated by taking the minimum of (/(lu*) — f{w*))/\\w^ — w*\\'^ along the trajectory 
obtained by the above minimization and multiplying the minimum value by a safety factor 
of 0.7. In order to estimate the residual norm — we use bounds ()42p and (I47p with 
a = 2 and the initial residual \\w^ — w*\\. The bound (j40p in Theorem [2] is used with the 
same initial residual to estimate the reduction in the function value. 



10. http : //www. Stanford. edu/"'boyd/ll_logr eg/ 

11. We assume that the minimizer is unique. 
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Figure 2: Empirical 



comparison 



of DAL, FISTA teeck and Teboulld . boo^ ). 



OWLQN dAndrew and Gaol . 120071 ) . and SpaRSA dWright et alJ . l2009l ). Top left: 
residual norm vs. number of iterations. Also the theoretical guarantees for DAL 
from Theorems [3] and [5] are shown. Top right: residual norm vs. CPU time. 
Bottom left: residual in the function value vs. number of iterations. Bottom 
right: residual in the function value vs. CPU time. 



In Figure O we show a result of a typical (single) run of the algorithms described above. 
Note that the result is not averaged to keep the meaning of theoretical bounds. 

In the top left panel of Figure O we can see that the convergence in terms of the norm of 
the residual vector w^—w* happens indeed rapidly as predicted by the theorems in Section[5j 
The yellow curve shows the result of Theorem [3l which assumes exact minimization of 
Eq. (|20p , and the magenta curve shows the result of Theorem [5l which allows some error in 
the minimization of Eq. (120p . We can see that the difference between the optimistic analysis 
of Theorem [3] and the realistic analysis of Theorem [5] is negligible. In this problem, in order 
to reach the quality of solution DAL achieves in 10 iterations OWLQN and SpaRSA take 
at least 100 iterations and FISTA takes 1,000 iterations. The IRS approach required about 
the same number of iterations as OWLQN and SpaRSA but each step was much heavier 
than those two algorithms (see also the top right panel in Figure [2]) and it was terminated 
after 100 iterations. 



28 



Dual Augmented-Lagrangian Converges Super-Linearly 



■ DAL-B (w/o heuristics) -| (f 

■ DAL-B (with heuristics) 
■L1 LOGREG 




10 20 30 
Number of iterations 



50 

CPU times (s) 



100 



Figure 3: Comparison of DAL-B and L1_L0GREG (|Koh et all . 120071 ) . Both algorithms 
estimate an unregularized bias term. The left panel shows the residual function 
value against the number of iterations. The right panel shows the same against 
the CPU time spent by the algorithms. 



The bottom left panel of Figure [2] shows comparison of five algorithms DAL, FISTA, 
OWLQN, SpaRSA, and IRS in terms of the decrease in the function value. Also plotted is 
the decrease in the function value predicted by Theorem [2] (magenta curve). The conver- 
gence of DAL is the fastest also in terms of function value. OWLQN and SpaRSA are the 
next after DAL and are faster than FISTA. 

DAL needs to solve a minimization problem at every iteration. Accordingly the opera- 
tion required in each iteration is heavier than those in FISTA, OWLQN, and SpaRSA. Thus 
we compare the total CPU time spent by the algorithms in the right part of Figure [2j It can 
be seen that DAL can obtain a solution that is much more accurate in less than 10 seconds 
than the solution FISTA obtained after almost 60 seconds. In terms of computation time, 
DAL and OWLQN seem to be on par at low precision. However as the precision becomes 
higher DAL becomes clearly faster than OWLQN. SpaRSA seems to be slightly slower than 
DAL and OWLQN. 

Two algorithms (DAL-B and L1_L0GREG) that also estimate an unregularized bias 
term are compared in Figure [3j The number of observations m = 1, 024 and the number of 
parameters n = 16, 384, and all other settings are identical as above. A variant of DAL-B 
that does not use the heuristics described in Section 17.1.41 is included for comparison. For 
DAL-B without the heuristics, the proximity parameters rj^^^ and 77^^ are both initialized to 
1 /A and increased by the factor 2. For DAL-B with the heuristics, the proximity parameter 
rjl is increased more aggressively; see Section [7.1.41 

In the left panel in Figure [3l the residual of primal objective values of both alg o rithni s 
are plotted against the number of iterations. As empirically observed in iKoh et al.i (|2007l ). 
L1_L0GREG converges linearly; after roughly 10 iterations, the residual function value 
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Figure 4: CPU time of various algorithms on synthetic logistic regression problems. 



reduc es by a factor around 2 in each iteration (a factor 1.85 was reported in iKoh et al 



(|2003)). The convergence of DAL-B is faster than L1_L0GREG and the curve is slightly 



concave downwards, which indicates the super-linearity of the convergence. Note also that 
the linear convergence bound from Theorem [2] is shown. The heuristics described in Sec- 
tion 17.1.41 shows almost no effect on this problem, probably because the design matrix is 
well conditioned. 

The right panel in Figure [3] shows the same information against the CPU time spent 
by the algorithms. DAL-B is roughly 10 times faster than L1_L0GREG to achieve residual 
less than 10~^. 

7.2.3 Scaling against the size of the problem 

Here we compare how well different algorithms scale against the number of parameters n. 
We fixed the number of samples m at m = 1, 024 and varied the number of parameters from 
n = 4, 096 to n = 524, 288. We used two regularization constants A = 0.1 and A = 0.01. 
The results are summarized in Figure HI Figures 4(a) and |4(b)| show the results for 



A = 0.01 and A = 0.1, respectively. In each figure we plot the CPU time spent to reach 
RDG< 10~^ against the number of parameters n. 

One can see that DAL has the mildest dependence on the number of parameters among 
the methods compared. In particular, DAL is faster than other algorithms for roughly 
n > 10"^. Also note that DAL and DAL-B show similar scaling against the number of 
parameters; i.e., adding an unregularized bias term has no significant influence on the 
computational efficiency. 

For A = 0.01, SpaRSA show s sharp increase in the CPU ti me from around n = 32,768, 
which is similar to the result in lTomioka and Sugiyama (|2009l l (Figure 3). Also notice the 



increased error-bar. In fact, for n > 65,536, it had to be stopped after 5,000 iterations in 
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some runs, whereas it converged after few hundred iterations in other runs. On the other 
hand, SpaRSA scales similarly to OWLQN and is more stable for A = 0.1. 

For all algorithms except L1_L0GREG, solving the problem for larger regularization 
constant A = 0.1 requires less computation than for A = 0.01. Nevertheless the advantage 
of the DAL algorithm is larger for the more computationally demanding situation of A = 0.01 
against FISTA, OWLQN, SpaRSA, and IRS. On the other hand, the advantage of DAL 
against L1_L0GREG is larger for A = 0.1, because the CPU time of L1_L0GREG is almost 
constant in both cases. The CPU time of DAL with (DAL-B) and without (DAL) the bias 
term are almost the same. 

7.2.4 Choice of rjo 

In this subsection, we show how the choice of the sequence 7]t changes the behaviour of DAL 
algorithm. We ran DAL algorithm for A = 0.1 with ijq = 1/A (as above), which we call the 
aggressive setting, and r]Q = 0.01/A, which we call the conservative setting. In both cases, 
rjt is increased by a factor of 2 as in the previous experiments. No bias term is used. 

In Figure O plotted are the number of PCG steps for inner minimization and the CPU 
time spent by DAL algorithm with the conservative setting (r/o = 0.01/A, left) and the 
aggressive setting (r/o = 1/A, right). The average number of PCG steps and CPU time 
are shown as stacked bar-plots, in which each segment of a bar corresponds to one outer 
iteration. One can see that in the conservative setting, DAL uses roughly 8 to 10 outer 
iterations, whereas in the aggressive setting, the number of outer iterations is reduced to 
less than a half (3 or 4). On the other hand, the total number of PCG steps is only 
slightly smaller in the aggressive setting. Therefore, in the aggressive setting DAL spends 
more PCG steps for each outer iteration. It is worth noting that almost half of the PCG 
iterations are spent for the first outer iteration in the aggressive setting, whereas in the 
conservative setting the PCG steps are more distributed. In terms of the CPU time, the 
aggressive setting is about 10-30% faster than the conservative setting because it saves 
both computation required for each outer-iteration and inner-iteration. However, generally 
speaking increasing the proximity parameter rjt makes the condition of the problem worse; 
in fact we found that the algorithm did not always converge for rjQ = 100/A. Thus it is not 
recommended to use too large value for ijt. 

Figure [6] compares the total CPU time spent by the two variants of DAL. As discussed 
above, the aggressive setting (ryo = 1/A) is faster than the conservative setting (ryo = 0.01/A). 
However the difference is minor compared to the change in the proximity parameter ryo, 

7.3 Benchmark datasets 

In this subsection, we apply the algorithms discussed in the previous subsection except IRS 
to benchmark datasets, and compare their efficiency on various problems; IRS is omitted 
because it was clearly outperformed by other methods on the synthetic data. 
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Figure 5: Comparison of behaviours of DAL algorithm for different choices of initial prox- 
imity parameter rjQ. Left: rjo = 0.01/A (conservative setting). Right: r/o = 1/A 
(aggressive setting). On the top row, the cumulative numbers of PCG steps (in- 
ner steps) are shown. On the bottom row, the cumulative CPU time spent by 
the algorithm is shown. 
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Figure 6: Comparison of conservative (ryo = 0.01/A) and aggressive {rjQ = 1/A) choice of 
proximity parameter rjQ. Note that the aggressive setting is used in Sections 17.2.21 
and 17.2.31 and the conservative setting is used in Section 17.31 
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7.3.1 Experimental setting 

The benchmark datasets we use are five datasets from NIPS 2003 Feature Selection Chal- 
lengj^ . 20 newsgroups datasell^, and a bioinformatics dat^lll provided by Baranzini et al 



too4 ) 



The five datasets from the Feature Selection Chahenge (arcene, dexter, dorothea, 
gisette, and madelon) are all split into training-, validation-, and test-set. We combine 
the training- and validation-sets and randomly split each dataset into a training-set that 
contains two-thirds of the examples, and a test-set that contains the remaining one-third. 
We apply the £i-regularized logistic regression solvers to the training-set and report the 
accuracy on the test-set as well as the CPU time for training. This procedure was repeated 
10 times (also for the two other datasets below). The numbers of training instances and 
features, and the format of each dataset (sparse or dense) are summarized in Table [2j 

From the 20 newsgroups dataset (20news), we deal with the binary classification of 
category "alt. atheism" vs. "comp.graphics" . We use the preprocessed MATLAB format 
data. The original dataset consists of 1,061 training examples and 707 test examples. We 
again combine all the examples and randomly split them into a training-set containing 
two-thirds of the examples and a test-set containing the rest. The training example has 
n = 61, 188 feat ures which are provided as a sparse matrix. 

The goal in baranzini et all » is to predict the response (good or poor) to re- 



combinant human interferon beta (rIFN/3) treatment of multiple sclerosis patients from 
gene-expression measurements. The dataset is denoted as gene. The dataset consists of 
gene-expression profile of 70 genes from 52 subjects. We again randomly select two-thirds of 
the subjects for training and the remaining for testing. Following the setting in the original 
paper, we used only the expression data from the beginning of the treatment (t = 0) and 
preprocessed the data by taking all the polynomials up to third order, i.e., we compute 
(i) X, x^, and for each single feature x, (ii) xy, x'^y, and xy^ for every pair of features 
{x,y), and (iii) xyz for every triplet of features {x,y,z). As a result we obtain 62,195 
(= 3 • 70 + 3 • 2,415 + 54,740) features. 

In every dataset, we standardized each feature to zero mean and unit standard deviation 
before applying the algorithms. Since the standardized design matrix A is usually dense 
even if the original matrix A is sparse, we provide function handles that compute Ax and 
A^y instead of A itself with DAL, FISTA, OWLQN, and SpaRSA. This can be done by 
keeping the vector of means and standard deviations of the original design matrix as follows: 

Ax = AS~^x — ImTnJ S^^x, 

A^y = S-'A'^{y--l^lrrJy), 
m 

where m G M" is the vector of means and is a n x n diagonal matrix that has the standard 
deviations of the original features on the diagonal. If the standard deviation of any feature 
is zero, we placed one in th e corresponding e lement of S. L1_L0GREG is implemented 
with a similar technique; see iKoh et al.l (|2007l l. 



12. The d atasets are available from http://www.nipsfsc.ecs.soton.ac.uk/datasets/; see iGuvon et al.l 
l|2006l ) for more information. 

13. The dataset is available from http://people.csail.mit.edu/jrennie/20Newsgroups/. 

14. The data is available from http : //www.plosbiology . org/article/inf o : doi/ 10 . 1371/journal .pbio . 0030002. 
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We compare the CPU time that is necessary to compute the whole regularization path. 
In order to define the regularization path, we choose 20 log-linearly separated values from 
A = 0.5 to A = 0.001, where A is the normalized regularization constant defined in Sec- 
tion [TTLTI We apply a warm start strategy to all the algorithms; i.e., we sequentially solve 
problems for smaller and smaller regularization constants using the solution obtained from 
the last optimization (for a larger regularization constant) as the initial solution. 

All the methods were terminated when the relative duality gap fell below 10~^. For 
DAL algorithms (DAL and DAL-B) we choose the conservative setting, i.e., we initialize 
rjl^^ and t/q^^ as 0.01/A. 

7.3.2 Results 

Table [2] summarizes the problems and the performance of the algorithms. For each algo- 
rithm, we show the maximum test accuracy obtained in the regularization path and the 
CPU time spent to compute the whole path. The smallest and the second smallest CPU 
times are shown in bold-face and italic, respectively. One can see that DAL is the fastest 
in most cases when the number of parameters n is larger than the number of observations. 
In addition, the CPU time of two variants of DAL (with and without the bias term) tend 
to be similar except dorothea dataset. For most datasets, the accuracy obtained by DAL 
algorithm is close to FISTA, OWLQN, and SpaRSA, and the accuracy obtained by DAL-B 
is close to L1_L0GREG. 

Figure [7] illustrates a typical situation where DAL algorithm is efficient. Since the size 
of the inner minimization problem (j20p is proportional to the number of observations m, 
when m, DAL is more efficient than other methods that work in the primal. 

In contrast. Figure [8] illustrates the situation where DAL is not very efficient compared 
to other algorithms. In Figure[8l we can also see that for all algorithms except L1_L0GREG, 
the cost of solving one minimization problem grows larger as the regularization constant is 
reduced, whereas the cost seems almost constant for L1_L0GREG. 



8. Conclusion 

In this paper, we have extended DAL algorithm (jTomioka and Sugivamal . I2OO9I ) for general 
regularized minimization proble ms, and pro v ided it with a new view based on the proxi- 
mal minimization framework in Rockafellar (|l976bl l. Generalizing the recent result from 
Beck and Teboullel (|2009l ^. we improved the convergence results on super-linear convergence 



of augmented Lagrangian methods in literature for the case of sparse estimation. 

Importantly, most assumptions that we made in our analysis can be checked independent 
of data. Instead of assuming that the problem is strongly convex we assume that the loss 
function has a Lipschitz continuous gradient, which can be checked before receiving data. 
Another assumption we have made is that the proximation with respect to the regularizer 
can be computed analytically, which can also be checked without looking at data. Moreover, 
we have shown that such assumption is valid for the ^i-regularizer, group lasso regularizer, 
and any other support function of some convex set for which the projection onto the set 
can be analytically obtained. 



Compared to the general result in iRockafellari (Il976b), our result is stronger w hen the 



inner minimization is solved approximately. Compared to 



Kort and Bertsekas ( 19761 ). we do 
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Table 2: Results on benchmark datasets. We tested six algorithms, namely, DAL, DAL-B, 
FISTA, OWLQN, SpaRSA, L1_L0GREG on seven benchmark datasets. See main 
text for the description of the datasets. m is the number of observations, n is the 
number of features. For each algorithm, shown are the test accuracy and the CPU 
time spent to compute the regularization path with a warm-start strategy. All the 
numbers are averaged over 10 runs. Bold face numbers indicate the fastest CPU 
time. Italic numbers indicate CPU times that are within two times of the fastest 

CPU time. 





arcene 


dexter 


dorothea 


gisette 


madelon 


20news 


gene 


m 


133 


400 


767 


4667 


1733 


1179 


35 


n 


10000 


20000 


100000 


5000 


500 


61188 


62195 


format 


dense 


sparse 


sparse 


dense 


dense 


sparse 


dense 


DAL 


accuracy 


70.60 


91.75 


93.71 


97.70 


61.53 


92.84 


82.35 


time (s) 


3.47 


4.20 


36.61 


11.02 


16.73 


28.10 


5.56 


DAL-B 


accuracy 


72.54 


92.00 


93.05 


97.71 


61.43 


92.87 


81.18 


time (s) 


3.56 


111 


10.60 


82.96 


17.96 


26.31 


5.49 


FISTA 


accuracy 


70.60 


91.75 


93.79 


97.71 


61.51 


92.80 


82.35 


time (s) 


25.34 


7.24 


284.59 


52.19 


10.40 


2195 


108.27 


OWLQN 


accuracy 


70.60 


91.75 


93.76 


97.70 


61.56 


92.82 


82.35 


time (s) 


17.63 


5.25 


134.31 


10.96 


19.08 


23.11 


132.21 


SpaRSA 


accuracy 


70.90 


91.75 


93.71 


97.70 


61.55 


95.14 


78.24 


time (s) 


294.80 


29.98 


1377.20 


91.65 


10.11 


310.96 


1622.26 


Ll_LOG- 
REG 


accuracy 


72.84 


92.05 


93.05 


97.71 


61.48 


92.85 


81.18 


time (s) 


8.98 


3.39 


109.92 


98.31 


5.90 


21.48 


16.58 



not need to assume the strong convexity of the objective function, which is obviously violated 
for the dual of many sparsity regularized estimation problems; instead we assume that the 
loss f unction has Lipschit z con tinuous gradient. No t e tha t we use no asymptotic arguments 



as m 



Rockafellarl (|l976bl ) and iKort and BertsekasI (|l976l l. Currently, our results does not 



apply to primal-based augmented Lagrangian method discussed in iGoldstein and Osher 
( 20081 ) for loss functions that are not strongly convex (e.g., logistic loss). The extension of 
our analysis to these methods is a future work. 

The theoretically predicted rapid convergence of DAL algorithm is also empirically con- 
firmed in simulated £i-regularized logistic regression problems. Moreover, we have compared 
six recently proposed algorithms for £i-regularized logistic regression, namely DAL, FISTA, 
OWLQN, SpaRSA, L1_L0GREG, and IRS on synthetic and benchmark datasets. On the 
synthetic datasets, we have shown that DAL has the mildest dependence on the number 
of parameters among the methods compared. On the benchmark datasets, we have shown 
that DAL is the fastest among the methods compared when the number of parameters is 
larger than the number of observations on both sparse and dense datasets. 

Furthermore, we have empirically investigated the relationship between the choice of 
the initial proximity parameter rjQ and the number of (inner /outer) iterations as well as the 
computation time. We found that the computation can be sped up by choosing a large value 
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Figure 7: Dorothea dataset (m = 800, n = 100,000). DAL is efficient in this case (m <^ n). 
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Figure 8: Madelon dataset (m = 2,000, n = 500). DAL is not very efficient in this case 
(m ^> n). 



for ryo; however the improvement is often smah compared to the change in rjQ and choosing 
large value for rjQ can make the inner minimization unstable by making the problem poorly 
conditioned. 

There are basically two strategies to make an efficient optimization algorithm. One 
is to use many iterations that are very light. FISTA, SpaRSA, and OWLQN (and also 
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stochastic approaches (jShalev-Shwartz and Srebrd . l2008l : iDuchi and Singer , mm) fall into 
this category. Theoretical convergence guarantee is often weak for these methods, e.g., 
0{l/k'^) for FISTA. Another strategy is to use a small number of heavier iterations. Interior 
point methods, such as L1_L0GREG, are prominent examples of this class. DAL can be 
considered as a member of the second class. We have theoretically and empirically shown 
that DAL requires a small number of outer iterations. At the same time, DAL inherits good 
properties of iterative shrinkage/thresholding algorithms from the first class. For example, it 
effectively uses the fact that the proximal operation can be computed analytically, and it can 
maintain the sparsity of the parameter vector during optimization. Furthermore, we have 
shown that the dual formulation of DAL makes the inner minimization efficient, because 

(i) typically the number of observations m is smaller than the number of parameters n, and 

(ii) the gradient and Hessian of the inner objective can be computed efficiently for sparse 
estimation problems. 

Future work inc ludes the extensi o n of our analysis to the primal-based augment e d La- 



grangian methods (|Yin et al.l . I2OO8I : lOoldstein and Osherl . I2OO8I : lYang and Zhand . I2OO9I : 



Lin et al.l . boOflh . application of approximate augmented Lagrangian me t hods and opera- 



see 



tor splitting methods to machine learning problems 
plication of DAL to more advanced sparse estirn ation problems 
Wipf and NagaraianI toO^ : IXomioka et ID ^201(h ). 



Tomioka et al. (l201lh'). a nd a 



[e.g. 



Cai et all ( 200, 
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Appendix A. Preliminaries on proximal operation 

This section con tains sorne basi c results on proxima l oper ation, which we use in l ater s ections 
and is based on Moreau ( 1965 ). Rockafellar ( 197d ). and Combettes and Waid (|2005l ). 



A.l Proximal operation 

Let / be a closed proper convex function over M" that takes values in M U {+00}. The 
proximal operator with respect to / is defined as follows: 



proxJz) = argmin ( f{x) H — ||a; — ) . 



(A.l) 



Note that because of the strong convexity of the minimand in the right-hand side, the above 
minimizer is unique. Similarly we define the proximal operator with respect to the convex 
conjugate function /* of / as follows: 



proxp(z) = argmin ( f*{x) H — \\x — z|p ] . 



The following elegant result is well known. 
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Lemma 8 (Moreau's decomposition) The proximation of a vector z £ with respect 
to a convex function f and that with respect to its convex conjugate f* is complementary in 
the following sense: 

proxj(jz) + proxj, (2;) = z. 
Proof The proof can be found in iMoreaul d 19651 ) and iRockafella^ (|l970l . Thm. 31.5). Here, 



we present a slightly more simple proof. 

Let X = proxj{z) and y = proxj. (z). By definition we have df{x) + x — z 3 and 
df*{y) + y — z 3 0, which imply 



and 



df{x) 3 z — X, 
df*{z — x) 3 X, 



df*{y)3z-y, 
df{z -y)3y, 



respectively, because (a/)-i = df* (|Rockafellail .fl970l. Cor. 23.5.1) 
From Eqs. ()A.2aP and ()A.3ap . we have 

fiz -y)> fix) + {z-y- x)'^{z - x), 
f*{z -x)> f*{y) + {z-x- y)^{z - y). 

Similarly, Eqs. (IX2bl) and ([Ob]l give 

f*{y) > f*{z - x) + {y - z + x)^ X, 
fix) > fiz-y) + ix-z + y)^y. 



bummme both sides of Eqs. ^KM - ^Kl) . we have 

> 2\\z -X- yf, 
from which we conclude that x + y = z. 



(A.2a) 
(A.2b) 



(A.3a) 
(A.3b) 



(A.4) 
(A.5) 



(A.6) 
(A.7) 



Proximal operation can be considered as a generalization of the projection onto a convex 
set. For example, if we take / as the indicator function of the £00 ball of radius A, i.e., 
fiz) = S'^iz) (see Eq. ([5])), then the proximal operation with respect to / is the projection 
onto the £oo-ball ([8]). On the other hand, the proximal operation with respect to the ii- 
regularizer is the soft-thresholding operator ([9]). Therefore, we notice that 

proj[_A,A](2:) + prox^'(z) = z, 

which is a special case of Lemma El because the £i-regularizer is the convex conjugate of 
the indicator function of the £oo-ball of radius A; see Figure [9l 
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A. 2 Moreau's envelope 

The minimum attained in Eq. (jA.ip is called the Moreau envelope of /: 

F(z) = min ( f(x) H — lla; — z\ 



The decomposition in Lemma [8] can be expressed in terms of envelope functions as 
follows. 

Lemma 9 (Decomposition and envelope functions) Let f and f* be a pair of convex 
conjugate functions, and let F and F* be the Moreau envelopes of f and /*, respectively. 
Then we have 

F{z) + F*{z) = \\\zf. 

Proof Let x = proxj(2;) and y = proxj*(z) as in the proof of Lemma [8l From the 
definition of convex conjugate /*, we have 

fix) + f*{y) = y^x, 

because y = z — x ^ df{x) (jRockafellarl . Il97d . Theorem 23.5). Therefore, we have 



F{z) + F*{z) = fix) + ]^\\yf + f*{y) + ]^\\xf 

T ^ II ii2 II ii2 

= V X -\ — -y H — \\x\\ 
y -r ^\\y\\ -r ^11 II 



1, 



\x + y\ 



1, 



2" 2" " ' 

where we used x + y = z from Lemma [5] in the last line. ■ 

Note that F* is the Moreau envelope of /* and not the convex co njugate of .F . 

Moreau's envelope can be considered as a inf-convolution (see Rockafellar ( 197d )) of / 
and a quadratic function || • P/2. Accordingly it is differentiable and the derivative is given 
in the following lemma. 



Lemma 10 (Derivative of Moreau's envelope) Moreau's envelope function F in Eq. (jX7 
is continuously differentiable (even if f is not differentiable) and the derivative can be writ- 
ten as follows: 

VF{z) = pioXf,{z). 



Proof The proof can be found in iMoreaul (|l965l ) and iRockafellaii (Il97d . Thm. 3L5). We 
repeat the proof below for completeness. The proof consists of two parts. We first show 
that for all z,z' e 



Fiz') > F{z) + [z' - zfy, 



(A.9) 



39 



ToMiOKA, Suzuki, and Sugiyama 




-I I -10 1 



P^°j[-x,;.](^) (d) pro)^Xw) 

Figure 9: (a) The £i-regularizer (dashed) and its envelope function (sohd). (b) The 
indicator function 6'^ (dashed) and its envelope function $^ (solid), (c) The 
derivative of ^a, which is the projection onto the interval [—A, A]; see Eq. ([8]). (d) 
The derivative of <I>^, which is called the soft-threshold function Q. Note that 
the £i-regularizer and the indicator function 5^ are conjugate to each other. 



where y = proxj*(z), which implies that y = proxj.(z) S dF{z). Second, we show that 

F{z') < F{z) + (z^ - z^y + ~ , (A.IO) 



which implies the uniqueness of the subgradient of -F(2;) for all z. 

Inequality ()A.9p follows easily from the definition of the envelope function F and LemmaO 
as follows: 

F{z') - F{z) = fix') + ^-\\y'f - fix) - ^\\yf 

fix')-fix)) + [\\\yr-l\\yf 
> ix' - x)^y + iy' - y)^y 
= iz' - z)^y, 

where x = proxj(z), y = proxj*(z), and x' and y' are similarly defined. We used the 
convexity of / with y £ dfix) and the convexity of || • |p/2 in the third line. 
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Second, we obtain inequality (jA.lOp by upper-bounding F{z') as follows: 
F(z') = minf/(^) + -||^-z'| 



<f(x) + ^\\x-z'f 

= F(z) - -\\x - zf + -\\x- z'f 
\ J 2" "2" " 

= F(z) - -\\x - zP + - \\x - z + z - z'P 
= F{z) + {x- zfiz - z') + ^\\z' - zf 
= F{z) + y^{z' -z) + ]^\\z' -z\\\ 



The envelope functions of two convex functions (t)\{w) = A|7u| and </>^(w) = 6'^{w), and 
their derivatives (the projection and the soft-threshold function ([9]), respectively) are 
schematically shown in Figure O 

Appendix B. Derivation of Equation ( l20l) 



Eq. (UHl) = max < -fpi-a) + min [ (b\{w) + — \\w - - mA^cxW^ 



— \\w' + r]tA'cx\\') + 



max ( -fl{-a) + —^xr,t{w* + VtA~^ct) - i^W + TltA'^ctf^ + 
ogr™ V Vt 2r?t J 2r]t 



t||2 



max ( -/;(-«) - -^InA^' + r^^A a) J + , 



where we used the definition of the Moreau envelope in the second line and Lemma [9] in the 
third line. Finally omitting the constant term ||i(;*|p/(2?7() in the last line and reversing the 
sign we obtain Eq. (I20p . 



Appendix C. Proofs 
C.l Proof of Theorem [3] 

Proof The first step of the proof is to generalize Lemma [1] in two ways: first we allow 
a point wl in the set of minimizers W* to be chosen for each time step, and second, we 
introduce a parameter /i to tighten the bound. Let be the closest point from in W*, 
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namely := argmin^.gp^/, — ^f*!!- From the proof of Lemma [H we have 



> - W*f - - T^llllw* - W*\\ 



-II II " V 

TI/*I|2 1 ||„„i w*l|2 



(V/i > 0) 



1 _ ^ ) \\w'+^ _ 1^*11^ _ —\\w' _ w*\\', W 



where the last inner product (^w^^^ — w^^^ ,w*' — w^'^^'^ in the third line is non-negative 
because the set of n iinimi zers W* is a convex set, and w*~^^ is the projection of w^~^^ onto 
W*: see iBertsekad (|l999l . Proposition B.ll). In addition, the fifth line follows from the 



inequality of arithmetic and geometric means. 

Note that by setting = 1 in (*) and = w^~^^, we recover Lemma [TJ Now using 
assumption (Al), we obtain the following expression: 

(2/i - ^u^) \\w^+^ - W*f + 2fiar]t\\w^+^ - W*^ < Ik* - W*f. 

Maximizing the left hand side with respect to ^, we have /x = 1 + ar]t\\w^~^^ — and 
accordingly, 

Taking the square-root of both sides we obtain 

-W*\\+ ar]t\\w*+^ - W*r~^ < - W*\\. (C.l) 

The last part of the theorem is obtained by lower-bounding the left-hand side of the above 
inequality using Young's inequality as follows: 

-W*\\+ ai]t\\w^^^ - W*r-^ 

= (1 + aTjt) ( -^—\\w'+' - W*\\ + -^^\\w'+' - W*r-' 
\l + (Jr]t 1 + ar]t 

1 {a — l)(T77f 

> {1 + ar]t)\\w^'^'^ - W*\\^+^t . - W*\\ 1+-"* 

l + (a-l)o-T)f 

= {l + a'nt)\\w^^^ -W*\\ . 
Substituting this relation back into inequality (jC.lh completes the proof of the theorem. 



C.2 Proof of Lemma |4] 

Proof First let us define 5* G as the gradient of the AL function ()22p at the approximate 
minimizer a* follows: 

6' := V(^t(a*) = -V/;(-«*) + Aw'+\ 
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where w^^^ := prox^^^ (id* + rjfA^cx''). Note that ||5*|| < y'^ll'"^*^"'^ — w^\\ from assump- 
tion (A4). Using Cor. 23.5.1 from iRockafellai] (|l97d l. we have 

V/KAiD*+i - 6') = Vfe (V/;(-a*)) = -a\ (C.2) 

which imphes that if 5* is smah, — q* is approximately the gradient of the loss term at w''^^. 

Moreover, let q* = tu* + rjtA^ot''. Since tu*"^^ = prox^^^ (q*) (assumption (A3)), we 
have 

9</)A^,(t(;*+i) + (tD*+i-q*)9 0, 

which implies 

{q'-w'+')/r]t£dMw'+^), (C.3) 

because (pxr^^ = r]t4>x- 

Now we are ready to derive an analogue of inequahty ()39p in the proof of Lemma [TJ Let 
w G M" be an arbitrary vector. We can decompose the residual value in the left hand side 
of inequality ([39]) as follows: 

^ V ' 

(A) 

^ V ' 

(B) 

^ V ' 

(C) 

The above terms (A), (B), and (C) can be separately bounded using the convexity of fe 
and (/)x as follows: 

(A) : feiAw) - h{Aw'+' - S') > {A{w - w'+') + S\ -ex') , 

(B) : h{Aw'+^ - 5') - MAw'+') > - {S\ - , 

(C) : 4'x{w)-4>x(w'^^)>lw-w'^^, 



where (A) is due to Eq. (IC.2|1 . (B) is due to assumption (A2) and Hiriart-Urruty and Lemarechall 
(I1993I . Theorem X.4.2.2), and (C) is due to Eq. (iOSll . Combining (A), (B), and (C), we 
have the following expression: 

r?i(/(^) - /(«;*+!)) > {w' - w'+\w- w'+') - |i||<5*f . 

Note that the above inequality reduces to Eq. (j39p if \\St\\ = (exact minimization). Using 
assumption (A4), we obtain 

rjtifiw) - /(u;*+^)) > {w'^ -w + w- w'+^,w - w'+^) - ^\\w' - w^+^f 

1 II ii2 1 II / ii2 

= -\\W - W^^^lr \\W - If , 

2" " 2" " ' 
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which completes the proof. 



C.3 Proof of Theorem [6] 

Proof Let 5 = {1 — e)/{ar]t). Note that 5 < 3/4 < 1 from the assumption. Following the 
proof of Lemma H] with w = w*', we have 

Vtifiw'^')-fiw'+')) 



2' 



= -\\w^ — — -llw* — + - — — w^W'^ 

2" " 2" " 2 " " 



> h\w'+i -w*f --\\w' -W*f 

> -(1 - 6)\\w^+^ - W*\\\\w^ - W*\\ + (l - Ik*^^ - W*f - ^\\w^ - W*f 

> _(i -s)( J-w^t _ w*f + ^\\w'+^ - w*f 

+ (^1 - - - ^\\w' - W*f (V/^ > 0), 

where we used w*!!^ > 0, — w^^^,w'^~^^ — w^) > and (^w^~^^ — to*, to* — lo*) > 

in the sixth line; the seventh line follows from Cauchy-Schwartz inequality; the eighth line 
follows from the inequality of arithmetic and geometric means. 

Applying assumption (Al) with a = 2 to the above expression, we have 

^\\w' - W*f > -^f,\\w'+^ - W*f + ((1 - I) + ar]t) \\w'+^ - W*f - -\\w' - W*f. 

Multiplying both sides with — VF*|p, we have 

1-5 \\w^-W*\\^ 1-6 ^ (f d\ 6 \\w^-W*P ] , 



t+i_\Y*p- 2 I V 2/ ' 2\\wi+^-W*P 



z WW- 

Now we have to consider two cases depending on the sign inside the curly brackets. If the 
sign is negative or zero, we have 

, 6 6 \\w^-W*P 
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which impUes 

\\w*+^-W*f< \\w^-W*f. (C.5) 

Since 6 < 3/4, the factor in front of \\w^ — W*\\'^ in the right-hand side is clearly smaller 
than one. We further show that this factor is smaller than 1/(1 + eai]t)^. First we upper 
bound (5 by 1/(1 + earjt) as follows: 

1-e _ + ^ (l-e)e + 3/4 ^ 1 



ar]t 1 + earit 1 + eorn 1 + ea-qt 

Plugging the above upper bound into inequality ()C.5p . we have 

W^t+l _ ^*||2 < i ii^t _ ^*||2 

" - l + 2fTr?t" 

- (l + ear?t)(l + 2ar/t)" " " (1 + ea??t)2 " 

which completes the proof for the first case. 

If on the other hand, the term inside curly brackets is positive in Eq. ()C.4p . maximizing 
the right-hand side of Eq. ()C.4p with respect to ^ gives the following expression: 



(1 - 5)rt > 1 - ^ + o-r?t - ^r^, 



where we defined rt := — Vl^*||/||w*''~^ — W*\\. Because rt > 0, the above inequality 
translates into 



^ VI + 2ar]t6 -1 + 5 

n > 



> 



6 

1 + ar]t6 - (j'^rilS'^ -1 + 5 



5 

> 1 + ar]t{l - arit5) 
= 1 + earit. 

The second line is true because for x > 0, yj\ + x > 1 + x/2 — x^/4; the last line follows 
from the definition 5 = {1 — e)/{arjt). ■ 



C.4 Proof of Theorem [7] 

Proof Since the minimizer is unique, we denote the minimizer by w* and show that the 
following is true: 

f{w) — f{w*) > a\\w — (Vtu : \\w — w*]] < ctL). 
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Using Theorem X.4.2.2 in Hiriart-Urrutv and Lemarechal ( 19931 ). for ||/3|| < r, we have 



n(3)<no)+p'^vno) + ^\\f3f 

= -fiw*) + (3'^w* + ^\\Pf, 



where w* := argmin^gj^n f{w) = V/*(0) and /*(0) 
On the other hand, we have 



f{w)= sup {(3'w-r{(3) 

/3eK" ^ 

> sup (f3'^w-r{f3) 



., \\<r 

> sup ( (3^ {w 

\\P\\<T V 



w 



ii/3in+/K 



2L I 
2c-l I 



>f{w*) + ^\\w-w 



.*||2 
,*||2 



(ifc<l), 
(otherwise) , 



(C.6) 



where we used inequahty ()C.6P in the third hne; the last hne fohows because if c < 1, the 
maximum is attained at (3 = {w — w*)/L, and otherwise we can lower bound the value at 
the maximum by the value at /3 = (tw — w*) /{cL). Combining the above two cases, we have 
Theorem [71 ■ 
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Table 3: List of loss functions and their convex conjugates. Constant terms are ignored. For the multi-class logit loss, z,q: G 
]^m(c-i)^ where m is the number of samples and c is the number of classes; yik = 1 if the ith sample belongs to the 
kth class, and zero otherwise; i{k) := (i — l)c + k denotes the linear index corresponding to the kth output for the ith 
sample; 6i k denotes the Kronecker delta function. 



primal loss fe{z) 



conjugate loss fi(—a) 



gradient — V/^*( — a) 



Hessian V'^//(— a) 



1. ■sp" (j/j-Zj)^ " 
2 Z^i=l ?2 



Squared loss 



1 v^n 2 / \2 



diag (cr?, . . . ,(T,i) (a - y) 



diag (. 



+ {1 ~ a^yi) logjl ~ a^yi)) 



Logistic loss 



log 



^ELi((i-a>)iog(i-«0 

+ (1 + Ui) log(l + ai) - 2Qiy,) 



a 

> 

> 
o 



Hyperbolic 
secan t likeli- 
hood (|Haufe et 
I2OIOI ) 



1 1 l + 



diag( 



2(1-Qi)(l+Qi) 



a], 



Z 
H 
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> 
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o 
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fs 
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H 
w 

Ifi 
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3 

H 
> 



E ( E {Vik - ai(fc)) log(yife - ai(fc)) 



Multi- 
class logit 



Em / ^— \c — 1 

+ log(E^lie^^('»)+l)) 



c-l 



c-1 



+ E ai(fc)) log(y«c 4- E ai(fe))) 

fc=i 
< 1 



( Tomioka and M illeii. 



I2OIOI ) 



fc=i 

(0 < yik - a^k) 
(fc = l,...,c-l), 



(i = 1, . . . ,m; 
it = l,...,c- 1) 



+ 



Vik-otiik) 



m(c— 1) 



B.c+E^/^i a,(fc') /i(fe),j{0=i 
(i, j = 1,. . . ,m; 
k,l — 1, . . . , c — 1) 



Table 4: List of regularizers and their corresponding proximity operators (jl5p and the Envelope function (j2ip . The operation 
(•)+ is defined as := max(0, a;) and applies element-wise to a matrix. 



Description 



Regularizer 



Proximity operator prox;^ 



Envelope function $^ 



-regularizer 



( Tibshirani 119961 ) 



prox^i(-u;) = ({\wj \ - X)+^ 



^a(H = ^E"=i(K|-A)5 



Group 



lasso 



(jYuan and Lin 
2006h 
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«&I(^) = ^E0e«(ll^0l|-A)i 



{UiS-X)+V^ 



Trace 



norm 



(Fazel et al.. 2001 



M = AE-=i^.W 



prox^^"^ {w 
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Srebro et all |2005[) 



Elastic-net 



W = AE"=i((i-^)KI + H 



prox^"(it)) 



(|Zou and Hastie 
20051: 
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